As Cocoon 2.2 doesn't have a crawler anymore, this might be useful for some
people around here.
--
Reinhard Pötz Independent Consultant, Trainer & (IT)-Coach
{Software Engineering, Open Source, Web Applications, Apache Cocoon}
web(log): http://www.poetz.cc
--------------------------------------------------------------------
--- Begin Message ---
Hi all,
I have finished a first version of droids and checked it in.
Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT to get
started.
What is this?
-------------
Droids aims to be an intelligent standalone crawl framework that
automatically seeks out relevant online information based on the user's
specifications. The core is a simple crawler which can be
easily extended by plugins. So if a project/app needs special processing
for a crawled url one can easily write a plugin to implement the
functionality.
Why was it created?
-------------------
Mainly because of personal curiosity:
The background of this work is that Cocoon trunk does not provide a
crawler anymore and Forrest is based on it, meaning we cannot update
anymore till we found a crawler replacement. Getting more involved in
Solr and Nutch I see request for a generic standalone crawler.
How does the first implementation crawler-x-m02y07 looks like?
--------------------------------------------------------------
I took nutch, ripped out and modified the awesome plugin/extension framework
to create the droid core.
Now I could implement all funtionality in plugins. Droids should make it very
easy to extend it.
I wrote some proof of concept plugins that make up crawler-x-m02y07 to
- crawl an url (CrawlerImpl)
- extract links (only <a/> ATM) via a parse-html plugin
- merge them with the queue
- save or print out the crawled pages.
Why crawler-x-m02y07?
---------------------
Droids tries to be a framework for different droids.
The first implementation is a "crawler" with the name "x"
first archived in the second "m"onth of the "y"ear 20"07"
Rubdabadub already tested. :)
Thanks VERY much, I am very impressed: Droids is only 3 hours on the
server and already tested, you are fast. Thanks.
On Tue, 2007-02-20 at 07:41 +0100, rubdabadub wrote:
> Thanks for the link. I have done a quick test. Here is the report. I
> am on OSX Tiger with JDK 1.5 Also I like to say that in the README it
> would be better to add in the requirements section
> cd droids/lib
> wget ... stax.1.0 etc...
Not sure we have now
** If using JDK 1.5:
** cd lib/
** wget http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar
That is relative to the readme.
Can you explain what you mean?
>
> Here is the bug...1 (if the file has an .html extension)
>
> [java] INFO: uri http://localhost/AdvUnixProg.html
> [java] Feb 20, 2007 7:35:33 AM org.apache.droids.crawler.Xm02y07
> doTask
> [java] INFO: content type is: text/html
> [java] javax.xml.stream.XMLStreamException
> [java] at
> com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3700)
> [java] at
> com.bea.xml.stream.MXParser.more(MXParser.java:3715)
> [java] at
> com.bea.xml.stream.MXParser.parseProlog(MXParser.java:1968)
> [java] at
> com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1947)
> [java] at
> com.bea.xml.stream.MXParser.next(MXParser.java:1333)
> [java] at
> org.apache.droids.parse.html.HtmlParser.process(HtmlParser.java:65)
> [java] at
> org.apache.droids.parse.html.HtmlParser.getParse(HtmlParser.java:48)
> [java] at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:94)
> [java] at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
> [java] at org.apache.droids.Cli.main(Cli.java:40)
This bug is described more below.
>
> [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> doTask
> [java] WARNING:
> [java] java.lang.NullPointerException
> [java] at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:96)
> [java] at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
> [java] at org.apache.droids.Cli.main(Cli.java:40)
> [java] java.io.IOException: Not a directory
> [java] at java.io.UnixFileSystem.createFileExclusively(Native
> Method)
> [java] at java.io.File.createNewFile(File.java:850)
> [java] at
> org.apache.droids.handle.Save.createFile(Save.java:52)
// if we cannot create a file that means that the parent path
// does not exists
File path = new File(cache.getParent());
path.mkdirs();
cache.createNewFile();
Is failing. Hmm, need to look into this.
> [java] at
> org.apache.droids.handle.Save.writeOutput(Save.java:38)
> [java] at org.apache.droids.handle.Save.handle(Save.java:27)
> [java] at
> org.apache.droids.factory.HandlerFactory.handle(HandlerFactory.java:60)
> [java] at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:114)
> [java] at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
> [java] at org.apache.droids.Cli.main(Cli.java:40)
> [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> crawl
> [java] INFO: crawler-x-m02y07 reports:
> [java] Finished initial job
> http://www.raditex.se/utb/kurs/AdvUnixProg.html
> [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> crawl
> [java] INFO: Crawled a total of 1
>
> I wonder if I am missing something...
You nothing. The parse-html plugin is very basic ATM and assumes that
the incoming stream is valid xml!
You can try http://www.target-x.de/search.html BUT please stop the
crawler after a couple of pages. You will see that it will crawl more
then one page.
// extract links in Xm02y07
parse = parser.getParse(protocol.openStream(uri), new URL(uri));
Is null making the next line fail because the incoming stream may be
malformed or faulty HTML.
I took a short cut with this plugin because I tested with a Forrest
Dispatcher site which returned valid XHTML and used StAX to extract the
links. It should not be hard to enhance the parse-html plugin though.
Quick fix is to use JTidy for cleaning up malformed and faulty HTML in
parser.getParse(...) and use this for extracting the links.
Before I still need to write a droids factory, that one can write
another implementation then Xm02y07 as crawler plugin and invoke it via
the Cli. Another todo is to implement a dependency system like Apache
Ivy instead to copycat the nutch approach.
Further I would like to generally discuss the design taken and the
interfaces used.
What is missing?
What should be more general, what more specific?
Things like:
// pass the stream to the handlers for further processing
- handlersFactory.handle(protocol.openStream(uri),new URL(uri));
+ handlersFactory.handle(protocol.openStream(uri),new URL(uri),
parse);
How we can write a nutch crawler plugin to utilize native nutch plugins
and imitate the nutch crawler in Droids? Is nutch interested in such a
plugin?
Please test and report feedback to [EMAIL PROTECTED] I will happily
answer all mails there.
salu2
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XML consulting, training and solutions
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--- End Message ---