I'm doing incremental crawls against my site which has >300,000
documents in it. There are about 200-400 new documents a day. I'm
injecting these new URLs every day. I seem to have lost a day.
I tried reinjecting these URLs and recrawling them, but I still can't
pull up a result.

This is how I'm injecting and updating:

    bin/nutch inject db -urlfile $batch >> $INJECTLOG 2>&1
    bin/nutch generate db segments $batch >> $GENERATELOG 2>&1
    s1=$( ls -ltrd segments/2* \
        | awk '{FS=" "} {print $9}' \
        | tail -n1 )
    bin/nutch fetch segments/$s1 >> $FETCHLOG 2>&1
    bin/nutch updatedb db segments/$s1 >> $UPDATELOG  2>&1
    bin/nutch index segments/$s1 >> $INDEXLOG  2>&1
    bin/nutch dedup segments dedup.tmp >> $DEDUPLOG  2>&1

I can find an URL from the day's url-file ($batch from above):

$ grep -l http://www.site.com/releases/2005/12/site322438.htm \
urls.done/2005-12-*
>    urls.done/2005-12-14.0105.01

Then I find the segment(s) it got to, using the date of the segment dir:

$ bin/nutch fetchlist -local -dumpurls \
    segments/20051214010517 \
    | grep site322438
> Recno 162: http://www.site.com/releases/2005/12/site322438.htm

So I look up the record with fetchlist:

$ bin/nutch fetchlist -local -recno 162 \
    segments/20051214010517
> version: 2
> fetch: true
> page: Version: 4
> URL: http://www.site.com/releases/2005/12/site322438.htm
> ID: 7ce9582e9b9d1b3306e3ee59f50fa2da
> Next fetch: Wed Dec 21 01:05:26 PST 2005
> Retries since fetch: 0
> Retry interval: 1 days
> Num outlinks: 0
> Score: 1.0
> NextScore: 1.0
> 
> anchors: 0


So presumably its indexed and I should be able to find it (like I can
find other documents by document number), by searching for the phrase
"site322438". But I get no results.

Is this because the segment it's in isn't getting looked at?
Is there a version 1 that's taking precedence? Is this the kind of
behavior I can correct by merging my indexes? Am I on the right track?

TIA

-- 
Jed Reynolds
System Administrator, PRWeb International, Inc. 360-312-0892


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to