My segment merger is not functioning properly. I am unable to figure
out the problem.

These are the commands I am using.

bin/nutch inject crawl/crawldb seedurls

In a loop iterating 10 times:-

  bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5
  segment=`ls -d crawl/segments/* | tail -1`
  bin/nutch fetch $segment -threads 50
  bin/nutch updatedb crawl/crawldb $segment

After loop:-

bin/nutch mergesegs crawl/merged_segments crawl/segments/*
rm -rf crawl/segments/*
mv --verbose crawl/merged_segments/* crawl/segments
rm -rf crawl/merged_segments

Merging 10 segments to crawl/MERGEDsegments/20050529095045
SegmentMerger:   adding crawl/segments/20050528144604
SegmentMerger:   adding crawl/segments/20050528144619
SegmentMerger:   adding crawl/segments/20050528145426
SegmentMerger:   adding crawl/segments/20050528151323
SegmentMerger:   adding crawl/segments/20050528164032
SegmentMerger:   adding crawl/segments/20050528170544
SegmentMerger:   adding crawl/segments/20050528192341
SegmentMerger:   adding crawl/segments/20050528203512
SegmentMerger:   adding crawl/segments/20050528210029
SegmentMerger:   adding crawl/segments/20050529055733
SegmentMerger: using segment data from: crawl_generate
`crawl/MERGEDsegments/20050529095045' -> `crawl/segments/20050529095045'

As can be seen here, only crawl_generate was used to merge. Other
folders like parse_data, crawl_fetch were not used. Why?

So while inverting the links with this command:-

bin/nutch invertlinks crawl/linkdb crawl/segments/*

I get this error:-

LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20050529095045
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
doesnt exist : /opt/nutch-0.9infy00/crawl/segments/20050529095045/parse_data
        at 
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)

Please help.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to