Re: Much faster RegExp lib needed in nutch?

Doug Cutting Thu, 16 Mar 2006 16:08:57 -0800

Stefan Groschupf wrote:

Instead I would suggest go a step forward by add a (configurable)timeout mechanism and skip bad records in reducing in general.Processing such big data and losing all data because just of one badrecord is very sad.

That's a good suggestion. Ideally we could use Thread.interrupt(), butthat won't stop a thread in a tight loop. The only other option isthread.stop(), which isn't generally safe. The safest thing to do is torestart the task in such a way that the bad entry is skipped.

As far I know google's map reduce skip bad records also.

Yes, I the paper says that, when a job fails, they can restart it,skipping the bad entry. I don't think they skip without restarting thetask.

In Hadoop I think this could correspond to removing the task that failedand replacing it with two tasks: one whose input split includes entriesbefore the bad entry, and one whose input split includes those after.Or we could keep a list of bad entry indexes and send these along withthe task. I prefer splitting the task.

But determining which entry failed is hard. Unless we report everysingle entry processed to the TaskTracker (which would be too expensivefor many map function) then it is hard to know exactly where things werewhen the process dies.

We could instead include the number of entries processed in each statusmessage, and the maximum count of entries before another status will besent. This way the task child can try to send, e.g., about one reportper second to its parent TaskTracker, and adaptively determine how manyentries between reports. So, for the first report it can guess that itwill process only 1 entry before the next report. Then it processes thefirst entry and can now estimate how many entries it can process in thenext second, and reports this as the maximum number of entries beforethe next report. Then it processes entries until either the reportedmax or one second is exceeded, and then makes its next status report.And so on. If the child hangs, then one can identify the range ofentries that it was in down to one second. If each entry takes longerthan one second to process then we'd know the exact entry.

Unfortunately, this would not work with the Nutch Fetcher, whichprocesses entries in separate threads, not strictly ordered...


Doug

Re: Much faster RegExp lib needed in nutch?

Reply via email to