Zand100 opened a new issue, #3069: URL: https://github.com/apache/parquet-java/issues/3069
### Describe the bug, including details regarding any error messages, version, and platform. Sometimes a file is written that is missing the last byte, so it ends in `.PAR` when it should be `.PAR1`. This causes `EOFException` when attempting to read the file. ``` $ hexdump -C good.snappy.parquet| tail -n 10 004fff70 6b 2e 6c 65 67 61 63 79 44 61 74 65 54 69 6d 65 |k.legacyDateTime| 004fff80 18 00 00 18 4a 70 61 72 71 75 65 74 2d 6d 72 20 |....Jparquet-mr | 004fff90 76 65 72 73 69 6f 6e 20 31 2e 31 32 2e 33 20 28 |version 1.12.3 (| 004fffa0 62 75 69 6c 64 20 66 38 64 63 65 64 31 38 32 63 |build f8dced182c| 004fffb0 34 63 31 66 62 64 65 63 36 63 63 62 33 31 38 35 |4c1fbdec6ccb3185| 004fffc0 35 33 37 62 35 61 30 31 65 36 65 64 36 62 29 19 |537b5a01e6ed6b).| 004fffd0 dc 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c 00 00 |................| 004fffe0 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c 00 00 1c |................| 004ffff0 00 00 1c 00 00 1c 00 00 00 e7 0f 00 00 50 41 52 |.............PAR| 00500000 ``` This might be related - we are seeing this issue only on GCP, not AWS. For GCP we do disk seeks randomly and on AWS we do disk seeks sequentially. We can rerun a job that writes the corrupt parquet file, and it will succeed the second time, so it seems to be nondeterministic. This is on version 1.14.3. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
