Hi all-
The generate phase has always taken a lot of time for me, and I wanted to
report on this here. (note- this is not the really bad problem I mentioned
earlier, where it was going even an order of magnitude slower, that problem
went away and I can not reproduce it).
I have a crawldb that is 40 million items large, so I expect everything to
be slow, but generate is the slowest part now, taking up to 3 hours to
complete. I can do a linux "sort -n" on a file with 40million lines in about
20 minutes, and I believe that this is basically what generate is doing
(selecting the top scoring urls), in fact I think we can do better than linux
sort which should be n log n. In fact I think we could go effectively almost
"n" by going through the list of urls one by one and only storing in the topN
list when it appears above the cutoff ranking (this would be near n log n when
topN is near the database size, and near n when small compared to it).
Shouldn't generate be able to go faster than "sort -n"?
Am I missing something?
see you
-Jim