Hi,

On Thu, Feb 19, 2009 at 13:28, Bartek <[email protected]> wrote:
> Hello,
>
> I started to crawl huge amount of websites (dmoz with no limits in
> crawl-urlfilter.txt) with -depth 10 and -topN 1 mln
>
> My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)
>
>
> This fetching will not stop soon :) so I would like to convert already made
> segments (updatedb, invertlinks, index) but there are parts missing in them:
>
> [r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
> crawls/segments/20090216142840/
>


If you use -dir option then you pass segments directory not individual
segments, e.g:

bin/nutch invertlinks crawls/linkdb -dir crawls/segments

which will read every directory under segments

To pass individual directories skip -dir option:

bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840
>
> LinkDb: adding segment:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate
>
> ...
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data
>
> etc.
>
> When manualy trying to bin/parse segments it says that they are parsed.
>
>
> So my question is how to design whole proces of crawling large amount of
>  websites without limiting them for specific domains (like in regular search
> engine eg. google)?
>
> Should I make loops of small amount of links? Like -topN 1000 and then
> updatedb,invertlinks, index ?
>
>
> For now I can start crawling and any data will appear in weeks.
>
> I found that in 1.0 (so made already) you are introducing live indexing in
> nutch. Are there any docs that I can use of ?
>
> Regards,
> Bartosz Gadzimski
>
>
>
>



-- 
Doğacan Güney

Reply via email to