Re: crawl only option for Crawl.java and crawled content reader class

jian chen Mon, 26 Nov 2007 13:37:28 -0800

Hi, Isabel,

I think lot of the open source java crawlers are pretty much dead
projects. They haven't been updated for a long time.

I am on the fence of maybe just releasing my own pure crawler as an
open source version. Is anyone interested in using it?

The difference of my crawler compared to Nutch is mainly the following:

1) Use MySql as the backend for storing the crawled url and its raw
content and meta data.

2) Per site configuration. So you could schedule the crawler to crawl
different sites at different time and indexing them separately on a
site by site basis.

3) Runs in Eclipse directly. No need to install Cygwin.

4) Much simpler code base compared to Nutch, so you could tweak it easily.

I studied the Nutch 0.7 version and this crawler was done pretty much
based on that version in terms of the crawler architecture.

Anyone interested in using it?

Cheers,

Jian

On Nov 26, 2007 12:57 PM, Isabel Drost <[EMAIL PROTECTED]> wrote:
> On Saturday 24 November 2007, jian chen wrote:
> > But, since maintaining my own crawler almost becomes a full-time job
> > itself, plus, it lacks a lot of features compared to Nutch latest
> > version, so, I am kind of motivated to come back to Nutch and thinking
> > about using Nutch instead of my own.
>
> I do not know exactly how they compare against the nutch crawler, but there
> are a few Java open source crawlers* out there. Is there any specific reason,
> why you would prefer to use the nutch crawler over those?
>
> Isabel
>
> * http://java-source.net/open-source/crawlers
>
>
> --
> He that teaches himself has a fool for a master.-- Benjamin
> Franklin
>  |\ _,,,---,,_ Web: <http://www.isabel-drost.de>
>  /,`.-'`' -. ;-;;,_
> |,4- ) )-,_..;\ ( `'-'
> '---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]>
>

Re: crawl only option for Crawl.java and crawled content reader class

Reply via email to