Asif created PARQUET-2454:
-----------------------------

             Summary: Invoking flush before closing the output stream in 
ParquetFileWriter
                 Key: PARQUET-2454
                 URL: https://issues.apache.org/jira/browse/PARQUET-2454
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
    Affects Versions: cpp-15.0.0
            Reporter: Asif
             Fix For: 1.10.2


It has been observed in customer deployments that sporadically it so happens 
that using Spark, an "Insert overwrite"  generates an Invalid / Corrupted 
Parquet  file. There are no exceptions and tasks writing, are committed 
successfully and shutdown is graceful, once all the tasks are done.

However when the written files are read data corruption occurs, which on 
analysis shows that "Expected 15356 uncompressed bytes but got 15108"

deficit of 248 bytes.

Given the low frequency of occurrence, suspicion is that output stream is 
closed before fully flushing the buffered the data.

So suggestion is to add a flush method, between writing footer and closing the 
stream, in the end() method of 

*org.apache.parquet.hadoop.ParquetFileWriter*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to