I seem to be running into a roadblock with the resources I have available. The time it takes to split a segment into two segments using -slice goes off the hook when there are over 500k unfected urls.
I've been running generate/fetch for -topN 4000 and it has been incrementally increasing in time as expected, but there seems to be a tipping point. My thought is since I am doing the "mergesegs -slice" to only generate indexes to be distributed, I really only need segments that have fetched URLs and ignore all of unfected URLs. Any ideas on how to do this? Or any tips I should be doing in order to reduce the clutter I am picking up naturally? Here are the stats on what I have now. This run is currently sitting at over 15 hours. The previous itteration using -topN 4000 took only ~2 hours. I can pretty reliably reproduce this state. CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 563326 retry 0: 561944 retry 1: 638 retry 10: 47 retry 11: 70 retry 12: 55 retry 13: 36 retry 14: 6 retry 15: 6 retry 16: 1 retry 17: 5 retry 2: 101 retry 3: 84 retry 4: 46 retry 5: 56 retry 6: 52 retry 7: 56 retry 8: 51 retry 9: 72 min score: 0.0 avg score: 0.018914262 max score: 86.0 status 1 (db_unfetched): 514967 status 2 (db_fetched): 35825 status 3 (db_gone): 5235 status 4 (db_redir_temp): 2548 status 5 (db_redir_perm): 3252 status 6 (db_notmodified): 1499 CrawlDb statistics: done Once I do the generate/fetch/updatedb/mergesegs, I do a mergesegs -slice (1/2 * total_urls) This is the step that is taking too long. I then index each of the two new segments individually and send them off to their individual search nodes. Since these segments are only used for searching, can I generate segments that only have fectched URLs to index for the searchers? Thanks in advance for any insight! Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com