[Fwd: [Droids] Proof of concept implementation]

Reinhard Poetz Tue, 20 Feb 2007 08:43:40 -0800

As Cocoon 2.2 doesn't have a crawler anymore, this might be useful for somepeople around here.

--

Reinhard Pötz Independent Consultant, Trainer & (IT)-Coach

{Software Engineering, Open Source, Web Applications, Apache Cocoon}

                                       web(log): http://www.poetz.cc
--------------------------------------------------------------------

--- Begin Message ---

Hi all, 

I have finished a first version of droids and checked it in. 

Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT to get
started.

What is this?
 -------------
 Droids aims to be an intelligent standalone crawl framework that
 automatically seeks out relevant online information based on the user's
 specifications. The core is a simple crawler which can be
 easily extended by plugins. So if a project/app needs special processing
 for a crawled url one can easily write a plugin to implement the
 functionality.

Why was it created?
 -------------------
 Mainly because of personal curiosity:
 The background of this work is that Cocoon trunk does not provide a
 crawler anymore and Forrest is based on it, meaning we cannot update
 anymore till we found a crawler replacement. Getting more involved in
 Solr and Nutch I see request for a generic standalone crawler. 
 
How does the first implementation crawler-x-m02y07 looks like?
 --------------------------------------------------------------
 I took nutch, ripped out and modified the awesome plugin/extension framework 
to create the droid core.
 Now I could implement all funtionality in plugins. Droids should make it very 
easy to extend it.
 I wrote some proof of concept plugins that make up crawler-x-m02y07 to 
 - crawl an url  (CrawlerImpl)
 - extract links (only <a/> ATM) via a parse-html plugin
 - merge them with the queue
 - save or print out the crawled pages.
 
Why crawler-x-m02y07?
 ---------------------
 Droids tries to be a framework for different droids. 
 The first implementation is a "crawler" with the name "x"
 first archived in the second "m"onth of the "y"ear 20"07"

Rubdabadub already tested. :)

Thanks VERY much, I am very impressed: Droids is only 3 hours on the
server and already tested, you are fast. Thanks.

On Tue, 2007-02-20 at 07:41 +0100, rubdabadub wrote:
> Thanks for the link. I have done a quick test. Here is the report. I
> am on OSX Tiger with JDK 1.5 Also I like to say that in the README it
> would be better to add in the requirements section
> cd droids/lib
> wget ... stax.1.0 etc...

Not sure we have now 
** If using JDK 1.5: 
** cd lib/ 
** wget http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar 

That is relative to the readme. 

Can you explain what you mean?

> 
> Here is the bug...1 (if the file has an .html extension)
> 
>  [java] INFO: uri http://localhost/AdvUnixProg.html
>      [java] Feb 20, 2007 7:35:33 AM org.apache.droids.crawler.Xm02y07
> doTask
>      [java] INFO: content type is: text/html
>      [java] javax.xml.stream.XMLStreamException
>      [java]     at
> com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3700)
>      [java]     at
> com.bea.xml.stream.MXParser.more(MXParser.java:3715)
>      [java]     at
> com.bea.xml.stream.MXParser.parseProlog(MXParser.java:1968)
>      [java]     at
> com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1947)
>      [java]     at
> com.bea.xml.stream.MXParser.next(MXParser.java:1333)
>      [java]     at
> org.apache.droids.parse.html.HtmlParser.process(HtmlParser.java:65)
>      [java]     at
> org.apache.droids.parse.html.HtmlParser.getParse(HtmlParser.java:48)
>      [java]     at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:94)
>      [java]     at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
>      [java]     at org.apache.droids.Cli.main(Cli.java:40)

This bug is described more below.

> 
>      [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> doTask
>      [java] WARNING:
>      [java] java.lang.NullPointerException
>      [java]     at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:96)
>      [java]     at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
>      [java]     at org.apache.droids.Cli.main(Cli.java:40)
>      [java] java.io.IOException: Not a directory
>      [java]     at java.io.UnixFileSystem.createFileExclusively(Native
> Method)
>      [java]     at java.io.File.createNewFile(File.java:850)
>      [java]     at
> org.apache.droids.handle.Save.createFile(Save.java:52)

    // if we cannot create a file that means that the parent path
                // does not exists
                File path = new File(cache.getParent());
                path.mkdirs();
                cache.createNewFile();

Is failing. Hmm, need to look into this. 


>      [java]     at
> org.apache.droids.handle.Save.writeOutput(Save.java:38)
>      [java]     at org.apache.droids.handle.Save.handle(Save.java:27)
>      [java]     at
> org.apache.droids.factory.HandlerFactory.handle(HandlerFactory.java:60)
>      [java]     at
> org.apache.droids.crawler.Xm02y07.doTask(Xm02y07.java:114)
>      [java]     at
> org.apache.droids.crawler.Xm02y07.crawl(Xm02y07.java:60)
>      [java]     at org.apache.droids.Cli.main(Cli.java:40)
>      [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> crawl
>      [java] INFO: crawler-x-m02y07 reports:
>      [java] Finished initial job
> http://www.raditex.se/utb/kurs/AdvUnixProg.html
>      [java] Feb 20, 2007 7:35:34 AM org.apache.droids.crawler.Xm02y07
> crawl
>      [java] INFO: Crawled a total of 1
> 
> I wonder if I am missing something... 

You nothing. The parse-html plugin is very basic ATM and assumes that
the incoming stream is valid xml! 

You can try http://www.target-x.de/search.html BUT please stop the
crawler after a couple of pages. You will see that it will crawl more
then one page. 

 // extract links in Xm02y07
 parse = parser.getParse(protocol.openStream(uri), new URL(uri));

Is null making the next line fail because the incoming stream may be
malformed or faulty HTML.

I took a short cut with this plugin because I tested with a Forrest
Dispatcher site which returned valid XHTML and used StAX to extract the
links. It should not be hard to enhance the parse-html plugin though.
Quick fix is to use JTidy for cleaning up malformed and faulty HTML in
parser.getParse(...) and use this for extracting the links.

Before I still need to write a droids factory, that one can write
another implementation then Xm02y07 as crawler plugin and invoke it via
the Cli. Another todo is to implement a dependency system like Apache
Ivy instead to copycat the nutch approach. 

Further I would like to generally discuss the design taken and the
interfaces used.

What is missing?

What should be more general, what more specific?

Things like:
// pass the stream to the handlers for further processing
-        handlersFactory.handle(protocol.openStream(uri),new URL(uri));
+        handlersFactory.handle(protocol.openStream(uri),new URL(uri),
parse);

How we can write a nutch crawler plugin to utilize native nutch plugins
and imitate the nutch crawler in Droids? Is nutch interested in such a
plugin?

Please test and report feedback to [EMAIL PROTECTED] I will happily
answer all mails there.

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--- End Message ---

[Fwd: [Droids] Proof of concept implementation]

Reply via email to