Stefan I am using window xp pro. Can u let me know how I can achieve same in windows environment. What steps I need to follow so that I neither overwrite nor merge two web db together. Instead web db update as and when new list of urls are added.
On 12/22/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > In general I suggest using a shell script and doing the command > manually instead of using the crawl command, may something like. > > NUTCH_HOME=$HOME/nutch-0.8-dev > > while [ 1 ] > # or may just 10 rounds > do > DATE=$(date +%d%B%Y_%H%M%S) > > $NUTCH_HOME/bin/nutch generate /user/nutchUser/ > crawldb /user/nutchUser/segments -topN 5000000 > s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/ > segments | tail -1 | cut -c 1-38` > $NUTCH_HOME/bin/nutch fetch $s > $NUTCH_HOME/bin/nutch updatedb /user/nutchUser/ > crawldb $s > # only when indexing $NUTCH_HOME/bin/nutch > invertlinks /user/nutchUser/linkdb /user/nutchUser/segments > # what to index, may the merged segment from the 10 > rounds s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/ > nutchUser/segments | tail -1 | cut -c 24-38` > # index $NUTCH_HOME/bin/nutch index /user/nutchUser/ > indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/ > nutchUser/segments/$s > > done > > This prevent you from merging crawl db's. > Than you only need the merged segment, the linkdb and the index from > the merged segment. > The 10 segments used to build the merged segment can be removed. > > Hope this helps, you should only may change the scripts to have a 10 > round loop to create you 10 segments and the merging command is also > not in the script. > Stefan > > Am 21.12.2005 um 18:28 schrieb Bryan Woliner: > > > I am using nutch 0.7.1 (non-mapred) and am a little confused about > > how to > > move the contents of several "test" crawls into a single "live" > > directory. > > Any suggestions are very much appreciated! > > > > I want to have a "Live" directory that contains all the indexes that > > are ready to be searched. > > > > The first index I want to add to the "Live" directory comes from a > > crawl with 10 rounds of fetching, whose db and segments are stored in > > the following directories: > > > > /crawlA/db/ > > /crawlA/segments/ > > > > I can merge all of the segments in the segments directory (using > > bin/nutch mergesegs), which results in the following (11th) segment > > directory: > > > > /crawlA/segments/20051219000754/ > > > > I can then index this 11th (i.e. merged) segment. > > > > However, I have the following questions about which files and > > directories should be moved to the "Live" directory: > > > > 1. If I copy /crawlA/db/ to /Live/db/ and copy > > /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ , > > then I can start tomcat from /Live/ and I'm able to search the index > > fine. However, I'm note sure if that can be duplicated for my crawlB > > directory. I can't copy /crawlB/db/ > > to the "Live" directory because there is already a db directory there. > > What are the correct files and directories to copy from each crawl > > into the "Live" directory? > > > > 2. On a side note: am I even taking the correct approach in merging > > the 10 > > segments in > > the crawlA/segments/ directory before I index, or should I index each > > segment first and then merge the 10 indexes? If I was to take the > > latter approach (merging indexes instead of segments), which files > > from the > > /crawlA/ directory would I need > > to move to the "Live" directory. > > > > Thanks ahead of time for any helpful suggestions, > > --------------------------------------------------------------- > company: http://www.media-style.com > forum: http://www.text-mining.org > blog: http://www.find23.net > > > >
