I'm having some very bad performance when i try to generate a segment on a
Crawldb which contains 1M of urls.

I have a cluster of 2 machines, 200 maps, 5 reduce task.

I have setup 200 maps coz i faced different issue of OutOfMemory.

Correct me if i'm wrong but the process is in 2 step:
1- first job to extract all urls which could be crawled in the limit of my
TopN parameter
2- second job to partition by host and create 200 output (same nb as map nb)

Actually its in the second part where it take a long time. The process took
more than 5 hours. I think its huge.
What about you ? do you have similar performance ?

Actually there is one thing i found out, its that it will create 200 output
even if the output is empty.
For instance, my crawldb contains 1M of urls but only for 5 differents
hosts. It means that it the second job will partition the list to create 5
ouput files which contains the list of urls needed and 195 output files
empty. Hence it creates some bad performance because it waste some time to
copy the ouput from 1 server to the other.

Don't you think we can find a better way to partition the url ? either to
avoid creating empty files or to have a better partition over the whole list
of maps ?

E
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to