I have a similar question, but I am using nutch 0.7.1 (non-mapred). Any suggestions are very much appreciated!
I want to have a "Live" directory that contains all the indexes that are ready to be searched. The first index I want to add to the "Live" directory comes from a crawl with 10 rounds of fetching, whose db and segments are stored in the following directories: /crawlA/db/ /crawlA/segments/ I can merge all of the segments in the segments directory (using bin/nutch mergesegs), which results in the following (11th) segment directory: /crawlA/segments/20051219000754/ I can then index the 11th (i.e. merged) segment. However, I have the following questions about which files and directories should be moved to the "Live" directory: 1. If I copy /crawlA/db/ to /Live/db/ and copy /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ , then I can start tomcat from /Live/ and I'm able to search the index fine. However, if I now have a crawlB directory, I can't copy its db to the "Live" directory because there is already a db directory there. What are the correct files and directories to copy from each crawl into the "Live" directory? 2. Am I even taking the correct approach in merging the 10 segments in the crawlA/segments/ directory before I index, or should I index each segment first and then merge the 10 indexes? If I was to take the latter approach, which files from the /crawlA/ directory would I need to move to the "Live" directory. Thanks ahead of time for any helpful suggestions, Bryan On 11/21/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Ben Halsted wrote: > > I was wondering what the required file structure is for the web gui to work > > properly. > > > > Are all of these required? > > /db/crawldb > > /db/index > > /db/indexes > > /db/segments > > /db/linkdb > > The indexes directory is not used when a merged index is present. > > The crawldb and segments/*/crawl_parse directories are not used by the > web ui. > > > Also -- What is the proper way to merge segments and indexes? Can I simply > > move segments all into one directory then re-index it, or is there a better > > way? > > You should update the linkdb so that it contains links from all > segments. Then you can use the dedup and merge commands to create a new > index. Ideally you should also re-index after updating the linkdb, but > this is not required. > > Doug >
