I am currently using the last nightly nutch-0.8-dev build and
I am really confused about how to proceed after I have done two different "whole web" incremental crawl

The tutorial to me is not clear on how to merge the results after the two crawls in order to be able to
make a search operation.

Could some one please give me an Hints on what is the right procedure ?! here is what I am doing:

1. create an initial urls file  /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb   test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch index test/indexes test/crawldb linkdb test/segments/20060522144050

..and now I am able to search...
Now I run

9. nutch generate test/crawldb test/segments -topN 1000

and I will end up to have a new segment  :   test/segments/20060522151957

10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957


From this point on I cannot make any progresses much

A) I have tried to merge the two segments into a new one with the idea to rerun 
an invertlinks and index  on it but:

nutch mergesegs test/segments -dir test/segments whatever I specify as outputdir or outputsegment I get errors B) I have also tried to make invertlinks on all test/segments with the goal to run nutch index command to produce a second indexes directory, let say test/indexes1, an finally run the merge index on index2

  nutch invertlinks  test/linkdb  -dir test/segments

  This as created a new linkdb directory *NOT* under test as specified but as 
<user>/linkdb-1108390519

  nutch index  test/indexes1 test/crawldb linkdb test/segments/20060522144050
  nutch merge index2 test/indexes test/indexes1

  now I am not sure what to do; If I rename test/index2 to be test/indexes 
after having removed test/indexes
  I will not able to search anymore.


-Corrado



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to