Doug,
Instead I would suggest go a step forward by add a
(configurable) timeout mechanism and skip bad records in reducing
in general.
Processing such big data and losing all data because just of one
bad record is very sad.
That's a good suggestion. Ideally we could use Thread.interrupt(),
but that won't stop a thread in a tight loop. The only other
option is thread.stop(), which isn't generally safe. The safest
thing to do is to restart the task in such a way that the bad entry
is skipped.
Sounds like a lot of overhead but I agree there is no other chance.
As far I know google's map reduce skip bad records also.
Yes, I the paper says that, when a job fails, they can restart it,
skipping the bad entry. I don't think they skip without restarting
the task.
In Hadoop I think this could correspond to removing the task that
failed and replacing it with two tasks: one whose input split
includes entries before the bad entry, and one whose input split
includes those after.
It would be very nice if there would be any chance of recycle the
already processed records and just add a new task that process the
records from badrecord +1 to the end of the split.
But determining which entry failed is hard. Unless we report every
single entry processed to the TaskTracker (which would be too
expensive for many map function) then it is hard to know exactly
where things were when the process dies.
Something pops up in my mind would be splitting the task until we
found the one record that fails. Of course this is expansive sine we
have to may to process many small tasks.
We could instead include the number of entries processed in each
status message, and the maximum count of entries before another
status will be sent.
This sounds interesting. We would require some more meta data in the
reporter, but this is scheduled for hadoop 0.2. In this change I
would love to see the ability custom meta data in the report
( MapWritable?) also.
In combination with a public API that allows to access these task
reports we can have kind of lockmanager as described in the big table
talk.
This way the task child can try to send, e.g., about one report
per second to its parent TaskTracker, and adaptively determine how
many entries between reports. So, for the first report it can
guess that it will process only 1 entry before the next report.
Then it processes the first entry and can now estimate how many
entries it can process in the next second, and reports this as the
maximum number of entries before the next report. Then it
processes entries until either the reported max or one second is
exceeded, and then makes its next status report. And so on. If the
child hangs, then one can identify the range of entries that it was
in down to one second. If each entry takes longer than one second
to process then we'd know the exact entry.
Unfortunately, this would not work with the Nutch Fetcher, which
processes entries in separate threads, not strictly ordered...
Well it would work for all map and reduce task. MapRunnable
implementations can take care about bad records by itself since here
we have fully access to the record reader.
Stefan
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers