Hi all-

    The generate phase has always taken a lot of time for me, and I wanted to 
report on this here.  (note- this is not the really bad problem I mentioned 
earlier, where it was going even an order of magnitude slower, that problem 
went away and I can not reproduce it).

    I have a crawldb that is 40 million items large, so I expect everything to 
be slow, but generate is the slowest part now, taking up to 3 hours to 
complete.  I can do a linux "sort -n" on a file with 40million lines in about 
20 minutes, and I believe that this is basically what generate is doing 
(selecting the top scoring urls), in fact I think we can do better than linux 
sort which should be n log n.  In fact I think we could go effectively almost 
"n" by going through the list of urls one by one and only storing in the topN 
list when it appears above the cutoff ranking (this would be near n log n when 
topN is near the database size, and near n when small compared to it).  
Shouldn't generate be able to go faster than "sort -n"?

    Am I missing something?

                        see you
                            -Jim

Reply via email to