Doğacan Güney pisze:
Hi,


On Thu, Feb 19, 2009 at 13:28, Bartek <[email protected]> wrote:
Hello,

I started to crawl huge amount of websites (dmoz with no limits in
crawl-urlfilter.txt) with -depth 10 and -topN 1 mln

My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)


This fetching will not stop soon :) so I would like to convert already made
segments (updatedb, invertlinks, index) but there are parts missing in them:

[r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
crawls/segments/20090216142840/



If you use -dir option then you pass segments directory not individual
segments, e.g:

bin/nutch invertlinks crawls/linkdb -dir crawls/segments

which will read every directory under segments

To pass individual directories skip -dir option:

bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840

Thanks a lot!

It's working but it's a bit strange:

bin/nutch invertlinks crawls/linkdb -dir crawls/segments it's not working (the same error as previous message)

bin/nutch invertlinks crawls/linkdb crawls/segments/2009* (it's working correctly)


LinkDb: adding segment:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate

...

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data

etc.

When manualy trying to bin/parse segments it says that they are parsed.


So my question is how to design whole proces of crawling large amount of
 websites without limiting them for specific domains (like in regular search
engine eg. google)?

Should I make loops of small amount of links? Like -topN 1000 and then
updatedb,invertlinks, index ?


For now I can start crawling and any data will appear in weeks.

I found that in 1.0 (so made already) you are introducing live indexing in
nutch. Are there any docs that I can use of ?

Regards,
Bartosz Gadzimski








Reply via email to