How to index while fetcher works

Bartek Thu, 19 Feb 2009 03:28:56 -0800

Hello,

I started to crawl huge amount of websites (dmoz with no limits incrawl-urlfilter.txt) with -depth 10 and -topN 1 mln


My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)

This fetching will not stop soon :) so I would like to convert alreadymade segments (updatedb, invertlinks, index) but there are parts missingin them:

[r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dircrawls/segments/20090216142840/

LinkDb: adding segment:file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate

...

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnot exist:file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data


etc.

When manualy trying to bin/parse segments it says that they are parsed.

So my question is how to design whole proces of crawling large amount ofwebsites without limiting them for specific domains (like in regularsearch engine eg. google)?

Should I make loops of small amount of links? Like -topN 1000 and thenupdatedb,invertlinks, index ?



For now I can start crawling and any data will appear in weeks.

I found that in 1.0 (so made already) you are introducing live indexingin nutch. Are there any docs that I can use of ?


Regards,
Bartosz Gadzimski

How to index while fetcher works

Reply via email to