Hello,

I started to crawl huge amount of websites (dmoz with no limits in crawl-urlfilter.txt) with -depth 10 and -topN 1 mln

My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)


This fetching will not stop soon :) so I would like to convert already made segments (updatedb, invertlinks, index) but there are parts missing in them:

[r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir crawls/segments/20090216142840/


LinkDb: adding segment: file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate

...

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data

etc.

When manualy trying to bin/parse segments it says that they are parsed.


So my question is how to design whole proces of crawling large amount of websites without limiting them for specific domains (like in regular search engine eg. google)?

Should I make loops of small amount of links? Like -topN 1000 and then updatedb,invertlinks, index ?


For now I can start crawling and any data will appear in weeks.

I found that in 1.0 (so made already) you are introducing live indexing in nutch. Are there any docs that I can use of ?

Regards,
Bartosz Gadzimski



Reply via email to