[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Stefan Groschupf Thu, 16 Mar 2006 16:41:03 -0800

Doug,

Instead I would suggest go a step forward by add a(configurable) timeout mechanism and skip bad records in reducingin general.Processing such big data and losing all data because just of onebad record is very sad.
That's a good suggestion. Ideally we could use Thread.interrupt(),but that won't stop a thread in a tight loop. The only otheroption is thread.stop(), which isn't generally safe. The safestthing to do is to restart the task in such a way that the bad entryis skipped.


Sounds like a lot of overhead but I agree there is no other chance.

As far I know google's map reduce skip bad records also.
Yes, I the paper says that, when a job fails, they can restart it,skipping the bad entry. I don't think they skip without restartingthe task.
In Hadoop I think this could correspond to removing the task thatfailed and replacing it with two tasks: one whose input splitincludes entries before the bad entry, and one whose input splitincludes those after.

It would be very nice if there would be any chance of recycle thealready processed records and just add a new task that process therecords from badrecord +1 to the end of the split.

But determining which entry failed is hard. Unless we report everysingle entry processed to the TaskTracker (which would be tooexpensive for many map function) then it is hard to know exactlywhere things were when the process dies.

Something pops up in my mind would be splitting the task until wefound the one record that fails. Of course this is expansive sine wehave to may to process many small tasks.

We could instead include the number of entries processed in eachstatus message, and the maximum count of entries before anotherstatus will be sent.

This sounds interesting. We would require some more meta data in thereporter, but this is scheduled for hadoop 0.2. In this change Iwould love to see the ability custom meta data in the report( MapWritable?) also.In combination with a public API that allows to access these taskreports we can have kind of lockmanager as described in the big tabletalk.

This way the task child can try to send, e.g., about one reportper second to its parent TaskTracker, and adaptively determine howmany entries between reports. So, for the first report it canguess that it will process only 1 entry before the next report.Then it processes the first entry and can now estimate how manyentries it can process in the next second, and reports this as themaximum number of entries before the next report. Then itprocesses entries until either the reported max or one second isexceeded, and then makes its next status report. And so on. If thechild hangs, then one can identify the range of entries that it wasin down to one second. If each entry takes longer than one secondto process then we'd know the exact entry.
Unfortunately, this would not work with the Nutch Fetcher, whichprocesses entries in separate threads, not strictly ordered...

Well it would work for all map and reduce task. MapRunnableimplementations can take care about bad records by itself since herewe have fully access to the record reader.

Stefan


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Reply via email to