On 12 Feb 2018, at 20:21, Ryan Blue
I wouldn't say we have a primary failure mode that we deal with. What we
concluded was that all the schemes we came up with to avoid corruption couldn't
cover all cases. For example, what about when memory holding a value is
corrupted just before it is handed off to the writer?
That's why we track down the source of the corruption and remove it from our
clusters and let Amazon know to remove the instance from the hardware pool. We
also structure our ETL so we have some time to reprocess.
I could remove memory/disk buffering of the blocks as a source of corruption
leaving only working memory failures which somehow get past ECC, or bus errors
of some form.
Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo
list, Hadoop >= 3.2