Yes it is not stable- but is it better when a user has to look through n-tasklogs to get an exception which happened in a single task? But causes the whole job to hang and waste resources across the cluster?
We do not have to fully implement a fault tolerance in 0.4.0. But it would be a good start to propagate exceptions to grooms and they report back to master. In case the task really went down, the job can be killed. This can be configurable to not ship with an ¨unstable¨ framework. But for production use-cases this is a better way. Am 28.01.2012 02:46 schrieb "Edward J. Yoon" <[email protected]>:
