Re: crawl only option for Crawl.java and crawled content reader class

jian chen Fri, 23 Nov 2007 23:35:49 -0800

Hi, Cool Coder,

I also wrote a crawler based on a previous version of Nutch, version
0.7 for intranet crawl. In my crawler, I used MySql database as the
backend storage for urls and their metadata and content.


But, since maintaining my own crawler almost becomes a full-time job
itself, plus, it lacks a lot of features compared to Nutch latest
version, so, I am kind of motivated to come back to Nutch and thinking
about using Nutch instead of my own.

However, I have a different need for indexing, so, I don't think I
will use the Nutch indexing and search code. I would rather stick to
my own. Like you, I use a customized version of Lucene to index and
provide search.

I just hope to make Nutch easier to use for this kind of crawl-only scenarios.

I am willing to contribute code/documents in regard to this, if necessary.

Cheers,

Jian

On Nov 23, 2007 5:51 PM, Cool Coder <[EMAIL PROTECTED]> wrote:
> Definitely +1 from me.
>   Infact I am already looking for a crawler to replace my own juck :( 
> crawler. I am currently using my own crawler and lucene.
>
>   - RB
>
>
> jian chen <[EMAIL PROTECTED]> wrote:
>   Hi, Nutch Developers,
>
> We would like to use Nutch as a crawler but does not need the indexes.
> Looking at the Crawl.java, which is used for Intranet crawl, it is
> just easy to comment out the index, dedup and merge code.
>
> I see in this mailing list lot of people want to use Nutch only for
> crawling, so, how about providing a command switch in Crawl.java to
> support crawl only?
>
> Another semi-related proposal. Since Nutch will be used only for
> crawling purpose, we do need to extract the data out of Nutch after
> the crawl. This is also a recurring question asked in this mailing
> list. After doing quite some research, I found out that there are
> couple of utility programs that could be used to extract the content
> out of the crawled database. For example, the SequenceFileReader
> and/or the MapFileReader.
>
> So, how about providing a simple class that will just dump the data
> out? I know that the SegmentReader and/or CrawlDbReader could be used,
> but, they are sort of "heavy" as it is using the Map/Reduce algorithm
> to do stuff. In quite some cases, I believe people just need to dump
> the crawl data in the local file system into text file or database.
> So, a tool without using Map/Reduce might be better?
>
> I know Nutch is designed for internet crawl, but, there do exist a
> need for intranet crawl and easily manipulate the data, which one
> machine is good enough.
>
> Thanks,
>
> Jian
>
>
>
> ---------------------------------
> Get easy, one-click access to your favorites.  Make Yahoo! your homepage.

Re: crawl only option for Crawl.java and crawled content reader class

Reply via email to