Beside that, we may should add a kind of timeout to the url filter in general.

I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient?

No! What happens is that the url filter hang and than the complete task is time outed instead of just skipping this url. After 4 retries the complete job is killed and all fetched data are lost, in my case any time 5 mio urls. :-(
This was the real reason of the described problem in hadoop-dev.

Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad.
As far I know google's map reduce skip bad records also.

Stefan




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to