Based on the org.apache.nutch.Crawl class, I've created a ReCrawl class that can be run similarly to how a nutch intranet crawl is run (or how the recrawl scripts work). I've written the ReCrawl class to function in the manner of "SOLUTION 'A'" below. However, mine doesn't reload the web application for you since that wasn't something I needed to include for my uses. The usage is something like:
bin/nutch recrawl -dir existingCrawlDir -depth i -add addDays -topN topN Is there a reason this functionality wasn't previously built into nutch? Once I test this a bit more would the developers like a patch with my additions? Jeff -----Original Message----- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Thursday, September 20, 2007 9:54 AM To: [email protected] Subject: Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help We can do two things to solve this problem. SOLUTION 'A' 1. Once the 'depth' loop is complete, merge the segments in 'crawl/segments/'. ('crawl/segments/' will have one merged segment of the past plus all the segments generated in the depth loop, one for each iteration of the loop.) They are now merged as a single segment in MERGEDsegments with the following command. $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* 2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'. rm -rf crawl/segments mv $MVARGS crawl/MERGEDsegments crawl/segments 3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* 5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes 6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes 7. Delete crawl/NEWindexes. We are done! I think this is very similar to Jeff's solution. Alexis argued that:- > I am losing index of previous crawl. The thing to notice here is that, we can safely delete crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3 the indexes are generated from a merged segment into which the old segments have also been merged. So, we are not losing anything. SOLUTION 'B' 1. Generate the new segments in another directory, say, NEWsegments. $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN -adddays $adddays 2. After the depth loop is over, merge the new segments, into crawl/segments. 'crawl/segments' may have multiple merged segments (one for each past crawl) if this is not the first crawl. bin/nutch mergesegs crawl/segments crawl/NEWsegments/* So, now 'crawl/segments' contains multiple merged segments (one for each crawl in the past) and another merged segment from the current re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete 'crawl/NEWsegments'. 3. Store the latest merged segment in a variable. segment=`ls -d crawl/segments/* | tail -1` (From now onwards, we won't do the remaining operations for all the segments like we did in solution 'A'. We will do the remaining operations for the merged segment we have just created.) 4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb $segment 6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes (So with steps 3-6, we generated indexes for the new merged segment generated with this crawl only.) 7. Let's assume the past indexes were saved as 'crawl/indexes1', 'crawl/indexes2', etc. Now, all of them can be merged as. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/ crawl/indexes1/ crawl/indexes2 I think this is what Tomislav must have done whereas what Alexis must have done is a mix of solution 'A' and solution 'B'. For example, if you generate the 'crawl/NEWindexes' from the all the merged segments (new as well as old) and merge this NEWindexes (which is not strictly new) with old indexes, you will probably get that error. To summarize, in solution 'A' we are generating 'crawl/NEWindexes' from all the merged segments (new as well as old). So it is not strictly new. While merging we are merging only this because it has everything. In solution 'B' we are generating 'crawl/NEWindexes' from the most recent merged segment. So this is strictly new. So, while merging, we are merging NEWindexes with the old indexes into 'crawl/index'. Regards, Susam Pal http://susam.in/ On 9/20/07, Alexis Votta <[EMAIL PROTECTED]> wrote: > Hi Tomislav and Nutch users > > I could not solve the problem with your instructions. > > I crawled two times. In re-crawl. It generated crawl/NEWindexes. > crawl/indexes was generated in 1st crawl. > > I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/ > > Now search.jsp is showing error. > type Exception report > > message > > description The server encountered an internal error () that prevented > it from fulfilling this request. > > exception > > org.apache.jasper.JasperException: java.lang.RuntimeException: > java.lang.NullPointerException > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServl etWrapper.java:532) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j ava:426) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320 ) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.RuntimeException: java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja va:204) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342) > org.apache.jsp.search_jsp._jspService(search_jsp.java:247) > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j ava:384) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320 ) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja va:159) > org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegm ents.java:177) > > Is there any Crawl guru who can help? > > On 9/20/07, Tomislav Poljak <[EMAIL PROTECTED]> wrote: > > Hi, > > I had the same problem using re-crawl scripts from wiki. They all work > > fine with nutch versions up to 0.9 (0.9 included), but when using > > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that > > merge in nutch-0.9 (from re-crawl scripts): > > > > bin/nutch merge crawl/indexes crawl/NEWindexes > > > > did the merging of old indexes from crawl/indexes and the new indexes > > from crawl/NEWindexes and stored it in crawl/indexes. But with > > nutch-1.0-dev (from trunk) merge requires empty (new) output folder. > > > > Solution that works (I have tried it) is to do following: > > > > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes > > > > where crawl/index is new (output) folder, crawl/indexes is old indexes > > and crawl/NEWindexes is the new indexes. It is important to know that > > you can do this with as many indexes you want to merge (as many > > re-crawls), you only have to do: > > > > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ... > > > > but crawl/index must not exist (delete it or backup it). > > > > Nutch search web application will use merged index form crawl/index, > > this is from my web application log: > > > > 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean > > 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index > > in /home/nutch/test/trunk/crawl/index > > > > > > Hope this will help, > > > > Tomislav > > > > > > > > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote: > > > /nutch mergesegs $merged_segment -dir $segments > > > if [ $? -ne 0 ] > > > > >
