Asif created PARQUET-2454:
-----------------------------
Summary: Invoking flush before closing the output stream in
ParquetFileWriter
Key: PARQUET-2454
URL: https://issues.apache.org/jira/browse/PARQUET-2454
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Affects Versions: cpp-15.0.0
Reporter: Asif
Fix For: 1.10.2
It has been observed in customer deployments that sporadically it so happens
that using Spark, an "Insert overwrite" generates an Invalid / Corrupted
Parquet file. There are no exceptions and tasks writing, are committed
successfully and shutdown is graceful, once all the tasks are done.
However when the written files are read data corruption occurs, which on
analysis shows that "Expected 15356 uncompressed bytes but got 15108"
deficit of 248 bytes.
Given the low frequency of occurrence, suspicion is that output stream is
closed before fully flushing the buffered the data.
So suggestion is to add a flush method, between writing footer and closing the
stream, in the end() method of
*org.apache.parquet.hadoop.ParquetFileWriter*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]