Hi all, I'd appreciate your help with this question. I am using Nutch/Hadoop 0.8 (of 3/31/06). I am using DFS.I want to merge multiple crawls and search the combined content
For example, i'd like to be able to: - Crawl 1 million urls into a directory crawlA (with directories segments, crawldb, linkdb, indexes, index) - Similarly, crawl different 1 million urls into a directory crawlB - and then combine the contents or the indexes and be able to search the contents of the 2 million urls I searched this list and found similar questions. But I none of the answers worked for me, as some of them were specific to pre-0.8 Nutch. I tried several things already I made a new directory crawl-all (with empty subdirectories segments, crawldb, linkdb) then i copied crawlA/segments/<timestampA> and crawlB/segments/<timestampB> into crawl-all/segments, then I issued the command bin/nutch index crawl-all/indexes crawl-all/linkdb crawl-all/crawldb crawl-all/segments/<timestampA> crawl-all/segments/<timestampB>. What I got is just an almost empty crawl-all/indexes directotry (about 100 bytes in all). I also tried to index each segment separately into one common indexes directory (crawl-all/indexes) but I got an error on the second time I issued the index command that the directory (crawl-all/indexes) already exists. I am sure someone must have been able to to merge the results of multiple crawls using 0.8. I'd appreciate your help and please provide details. Thanks. Carl
