fpetersen-gl opened a new issue, #3254:
URL: https://github.com/apache/parquet-java/issues/3254

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Related to apache/iceberg#13508, possibly to #1971, 
   
   We're using parquet-java 1.15.2 as part of iceberg, uploading data to S3. 
The data is flushed to storage in configurable intervals.
   
   ## Description
   If a short network interruption happens exactly while writing and uploading 
files, the 
[`ParquetFileWriter`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java)
 is already transitioned into the state `ENDED`, even though the file has not 
been written successfully.
   Another call to close the (iceberg-) writer results in an exception of the 
`ParquetFileWriter`, being in an invalid state.
   
   Stacktrace:
   ```
   java.io.UncheckedIOException: Failed to flush row group
        at 
org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:225)
        at 
org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:257)
        at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82)
        at 
org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:126)
        at 
org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:156)
        at 
org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
        at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82)
        at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74)
        at 
org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31)
        at 
org.apache.iceberg.parquet.TestParquetWriter.testParquetWriterWithFailingIO(TestParquetWriter.java:113)
   [... Junit/JDK classes ...]
   Caused by: java.io.IOException: The file being written is in an invalid 
state. Probably caused by an error thrown previously. Current state: ENDED
        at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250)
        at 
org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224)
        at 
org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586)
        at 
org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215)
        ... 100 more
   ```
   
   ## Possible Solution
   Every time the internal field `state` is updated, it is done in the very 
beginning of a method, which is too early. If later in the method's code an 
exception is thrown, not all logic has been executed successfully, leaving the 
writer in an invalid state.
   But simply moving the transition in state to the end of the method could 
result in executing the code multiple times, if a retry-mechanism calls the 
method multiple times - that must also be avoided.
   One possibility would be to introduce more internal states. This would allow 
to track the state of the writer in more detail, which again makes it more 
resilient to retries.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to