Is there a way to create and index a segment that only has fetched URLs?

Jesse Hires Sat, 14 Nov 2009 07:56:46 -0800

I seem to be running into a roadblock with the resources I have available.
The time it takes to split a segment into two segments using -slice goes off
the hook when there are over 500k unfected urls.


I've been running generate/fetch for -topN 4000 and it has been
incrementally increasing in time as expected, but there seems to be a
tipping point.
My thought is since I am doing the "mergesegs -slice" to only generate
indexes to be distributed, I really only need segments that have fetched
URLs and ignore all of unfected URLs.

Any ideas on how to do this? Or any tips I should be doing in order to
reduce the clutter I am picking up naturally?

Here are the stats on what I have now. This run is currently sitting at over
15 hours. The previous itteration using -topN 4000 took only ~2 hours. I can
pretty reliably reproduce this state.

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     563326
retry 0:        561944
retry 1:        638
retry 10:       47
retry 11:       70
retry 12:       55
retry 13:       36
retry 14:       6
retry 15:       6
retry 16:       1
retry 17:       5
retry 2:        101
retry 3:        84
retry 4:        46
retry 5:        56
retry 6:        52
retry 7:        56
retry 8:        51
retry 9:        72
min score:      0.0
avg score:      0.018914262
max score:      86.0
status 1 (db_unfetched):        514967
status 2 (db_fetched):  35825
status 3 (db_gone):     5235
status 4 (db_redir_temp):       2548
status 5 (db_redir_perm):       3252
status 6 (db_notmodified):      1499
CrawlDb statistics: done


Once I do the generate/fetch/updatedb/mergesegs,
I do a mergesegs -slice (1/2 * total_urls)   This is the step that is taking
too long.
I then index each of the two new segments individually and send them off to
their individual search nodes.
Since these segments are only used for searching, can I generate segments
that only have fectched URLs to index for the searchers?

Thanks in advance for any insight!



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com

Is there a way to create and index a segment that only has fetched URLs?

Reply via email to