Hello,
I started to crawl huge amount of websites (dmoz with no limits in
crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
This fetching will not stop soon :) so I would like to convert already
made segments (updatedb, invertlinks, index) but there are parts missing
in them:
[r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
crawls/segments/20090216142840/
LinkDb: adding segment:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
...
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
etc.
When manualy trying to bin/parse segments it says that they are parsed.
So my question is how to design whole proces of crawling large amount of
websites without limiting them for specific domains (like in regular
search engine eg. google)?
Should I make loops of small amount of links? Like -topN 1000 and then
updatedb,invertlinks, index ?
For now I can start crawling and any data will appear in weeks.
I found that in 1.0 (so made already) you are introducing live indexing
in nutch. Are there any docs that I can use of ?
Regards,
Bartosz Gadzimski