I wouldn't say we have a primary failure mode that we deal with. What we
concluded was that all the schemes we came up with to avoid corruption
couldn't cover all cases. For example, what about when memory holding a
value is corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from
our clusters and let Amazon know to remove the instance from the hardware
pool. We also structure our ETL so we have some time to reprocess.


On Mon, Feb 12, 2018 at 11:49 AM, Steve Loughran <ste...@hortonworks.com>

> On 12 Feb 2018, at 19:35, Dong Jiang <dji...@dataxu.com> wrote:
> I got no error messages from EMR. We write directly from dataframe to S3.
> There doesn’t appear to be an issue with S3 file, we can still down the
> parquet file and read most of the columns, just one column is corrupted in
> parquet.
> I suspect we need to write to HDFS first, make sure we can read back the
> entire data set, and then copy from HDFS to S3. Any other thoughts?
> The s3 object store clients mostly buffer to local temp fs before they
> write, at least all the ASF connectors do, so that data can be PUT/POSTed
> in 5+MB blocks, without requiring enough heap to buffer all data written by
> all threads. That's done to file://, not HDFS. Even if you do that copy up
> later from HDFS to S3, there's still going to be that local HDD buffering:
> it's not going to fix the problem —not if this really is corrupted local
> HDD data

Ryan Blue
Software Engineer

Reply via email to