I'm having some very bad performance when i try to generate a segment on a Crawldb which contains 1M of urls.
I have a cluster of 2 machines, 200 maps, 5 reduce task. I have setup 200 maps coz i faced different issue of OutOfMemory. Correct me if i'm wrong but the process is in 2 step: 1- first job to extract all urls which could be crawled in the limit of my TopN parameter 2- second job to partition by host and create 200 output (same nb as map nb) Actually its in the second part where it take a long time. The process took more than 5 hours. I think its huge. What about you ? do you have similar performance ? Actually there is one thing i found out, its that it will create 200 output even if the output is empty. For instance, my crawldb contains 1M of urls but only for 5 differents hosts. It means that it the second job will partition the list to create 5 ouput files which contains the list of urls needed and 195 output files empty. Hence it creates some bad performance because it waste some time to copy the ouput from 1 server to the other. Don't you think we can find a better way to partition the url ? either to avoid creating empty files or to have a better partition over the whole list of maps ? E
