RE: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Bolle, Jeffrey F. Thu, 18 Oct 2007 08:06:03 -0700

Based on the org.apache.nutch.Crawl class, I've created a ReCrawl class
that can be run similarly to how a nutch intranet crawl is run (or how
the recrawl scripts work).  I've written the ReCrawl class to function
in the manner of "SOLUTION 'A'" below.  However, mine doesn't reload
the web application for you since that wasn't something I needed to
include for my uses.  The usage is something like:

bin/nutch recrawl -dir existingCrawlDir -depth i -add addDays -topN
topN

Is there a reason this functionality wasn't previously built into
nutch?  Once I test this a bit more would the developers like a patch
with my additions?

Jeff

-----Original Message-----
From: Susam Pal [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 20, 2007 9:54 AM
To: [email protected]
Subject: Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

We can do two things to solve this problem.

SOLUTION 'A'

1. Once the 'depth' loop is complete, merge the segments in
'crawl/segments/'. ('crawl/segments/' will have one merged segment of
the past plus all the segments generated in the depth loop, one for
each iteration of the loop.) They are now merged as a single segment
in MERGEDsegments with the following command.

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'.

rm -rf crawl/segments
mv $MVARGS crawl/MERGEDsegments crawl/segments

3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*
5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
7. Delete crawl/NEWindexes. We are done!

I think this is very similar to Jeff's solution. Alexis argued that:-

> I am losing index of previous crawl.

The thing to notice here is that, we can safely delete
crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3
the indexes are generated from a merged segment into which the old
segments have also been merged. So, we are not losing anything.

SOLUTION 'B'

1. Generate the new segments in another directory, say, NEWsegments.

$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN
-adddays $adddays

2. After the depth loop is over, merge the new segments, into
crawl/segments. 'crawl/segments' may have multiple merged segments
(one for each past crawl) if this is not the first crawl.

bin/nutch mergesegs crawl/segments crawl/NEWsegments/*

So, now 'crawl/segments' contains multiple merged segments (one for
each crawl in the past) and another merged segment from the current
re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete
'crawl/NEWsegments'.

3. Store the latest merged segment in a variable.

segment=`ls -d crawl/segments/* | tail -1`

(From now onwards, we won't do the remaining operations for all the
segments like we did in solution 'A'. We will do the remaining
operations for the merged segment we have just created.)

4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb $segment
6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

(So with steps 3-6, we generated indexes for the new merged segment
generated with this crawl only.)

7. Let's assume the past indexes were saved as 'crawl/indexes1',
'crawl/indexes2', etc. Now, all of them can be merged as.

$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/
crawl/indexes1/ crawl/indexes2

I think this is what Tomislav must have done whereas what Alexis must
have done is a mix of solution 'A' and solution 'B'.

For example, if you generate the 'crawl/NEWindexes' from the all the
merged segments (new as well as old) and merge this NEWindexes (which
is not strictly new) with old indexes, you will probably get that
error.

To summarize, in solution 'A' we are generating 'crawl/NEWindexes'
from all the merged segments (new as well as old). So it is not
strictly new. While merging we are merging only this because it has
everything.

In solution 'B' we are generating 'crawl/NEWindexes' from the most
recent merged segment. So this is strictly new. So, while merging, we
are merging NEWindexes with the old indexes into 'crawl/index'.

Regards,
Susam Pal
http://susam.in/

On 9/20/07, Alexis Votta <[EMAIL PROTECTED]> wrote:
> Hi Tomislav and Nutch users
>
> I could not solve the problem with your instructions.
>
> I crawled two times.  In re-crawl. It generated crawl/NEWindexes.
> crawl/indexes was generated in 1st crawl.
>
> I merged ==> bin/nutch merge crawl/index crawl/indexes/
crawl/NEWindexes/
>
> Now search.jsp is showing error.
> type Exception report
>
> message
>
> description The server encountered an internal error () that
prevented
> it from fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException: java.lang.RuntimeException:
> java.lang.NullPointerException
>
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServl
etWrapper.java:532)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
ava:426)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.RuntimeException: java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja
va:204)
>
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342)
>         org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
>
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
ava:384)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja
va:159)
>
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegm
ents.java:177)
>
> Is there any Crawl guru who can help?
>
> On 9/20/07, Tomislav Poljak <[EMAIL PROTECTED]> wrote:
> > Hi,
> > I had the same problem using re-crawl scripts from wiki. They all
work
> > fine with nutch versions up to 0.9 (0.9 included), but when using
> > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is
that
> > merge in nutch-0.9 (from re-crawl scripts):
> >
> > bin/nutch merge crawl/indexes crawl/NEWindexes
> >
> > did the merging of old indexes from crawl/indexes and the new
indexes
> > from crawl/NEWindexes and stored it in crawl/indexes. But with
> > nutch-1.0-dev (from trunk) merge requires empty (new) output
folder.
> >
> > Solution that works (I have tried it) is to do following:
> >
> > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes
> >
> > where crawl/index is new (output) folder, crawl/indexes is old
indexes
> > and crawl/NEWindexes is the new indexes. It is important to know
that
> > you can do this with as many indexes you want to merge (as many
> > re-crawls), you only have to do:
> >
> > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ...
> >
> > but crawl/index must not exist (delete it or backup it).
> >
> > Nutch search web application will use merged index form
crawl/index,
> > this is from my web application log:
> >
> > 2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new
bean
> > 2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged
index
> > in /home/nutch/test/trunk/crawl/index
> >
> >
> > Hope this will help,
> >
> > Tomislav
> >
> >
> >
> > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> > > /nutch mergesegs $merged_segment -dir $segments
> > > if [ $? -ne 0 ]
> >
> >
>

RE: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Reply via email to