[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Stefan Groschupf Thu, 16 Mar 2006 15:32:03 -0800

Beside that, we may should add a kind of timeout to the url filterin general.
I think this is overkill. There is already a Hadoop task timeout.Is that not sufficient?

No! What happens is that the url filter hang and than the completetask is time outed instead of just skipping this url.After 4 retries the complete job is killed and all fetched data arelost, in my case any time 5 mio urls. :-(

This was the real reason of the described problem in hadoop-dev.

Instead I would suggest go a step forward by add a (configurable)timeout mechanism and skip bad records in reducing in general.Processing such big data and losing all data because just of one badrecord is very sad.

As far I know google's map reduce skip bad records also.

Stefan




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Reply via email to