Stefan Groschupf wrote:
Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad.
That's a good suggestion. Ideally we could use Thread.interrupt(), but that won't stop a thread in a tight loop. The only other option is thread.stop(), which isn't generally safe. The safest thing to do is to restart the task in such a way that the bad entry is skipped.
As far I know google's map reduce skip bad records also.
Yes, I the paper says that, when a job fails, they can restart it, skipping the bad entry. I don't think they skip without restarting the task.
In Hadoop I think this could correspond to removing the task that failed and replacing it with two tasks: one whose input split includes entries before the bad entry, and one whose input split includes those after. Or we could keep a list of bad entry indexes and send these along with the task. I prefer splitting the task.
But determining which entry failed is hard. Unless we report every single entry processed to the TaskTracker (which would be too expensive for many map function) then it is hard to know exactly where things were when the process dies.
We could instead include the number of entries processed in each status message, and the maximum count of entries before another status will be sent. This way the task child can try to send, e.g., about one report per second to its parent TaskTracker, and adaptively determine how many entries between reports. So, for the first report it can guess that it will process only 1 entry before the next report. Then it processes the first entry and can now estimate how many entries it can process in the next second, and reports this as the maximum number of entries before the next report. Then it processes entries until either the reported max or one second is exceeded, and then makes its next status report. And so on. If the child hangs, then one can identify the range of entries that it was in down to one second. If each entry takes longer than one second to process then we'd know the exact entry.
Unfortunately, this would not work with the Nutch Fetcher, which processes entries in separate threads, not strictly ordered...
Doug