Hi,
On Thu, Feb 19, 2009 at 13:28, Bartek <[email protected]> wrote: > Hello, > > I started to crawl huge amount of websites (dmoz with no limits in > crawl-urlfilter.txt) with -depth 10 and -topN 1 mln > > My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs) > > > This fetching will not stop soon :) so I would like to convert already made > segments (updatedb, invertlinks, index) but there are parts missing in them: > > [r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir > crawls/segments/20090216142840/ > If you use -dir option then you pass segments directory not individual segments, e.g: bin/nutch invertlinks crawls/linkdb -dir crawls/segments which will read every directory under segments To pass individual directories skip -dir option: bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840 > > LinkDb: adding segment: > file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate > > ... > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data > > etc. > > When manualy trying to bin/parse segments it says that they are parsed. > > > So my question is how to design whole proces of crawling large amount of > websites without limiting them for specific domains (like in regular search > engine eg. google)? > > Should I make loops of small amount of links? Like -topN 1000 and then > updatedb,invertlinks, index ? > > > For now I can start crawling and any data will appear in weeks. > > I found that in 1.0 (so made already) you are introducing live indexing in > nutch. Are there any docs that I can use of ? > > Regards, > Bartosz Gadzimski > > > > -- Doğacan Güney
