Zand100 opened a new issue, #3069:
URL: https://github.com/apache/parquet-java/issues/3069

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Sometimes a file is written that is missing the last byte, so it ends in 
`.PAR` when it should be `.PAR1`. This causes `EOFException` when attempting to 
read the file.
   
   ```
   $ hexdump -C good.snappy.parquet| tail -n 10
   004fff70  6b 2e 6c 65 67 61 63 79  44 61 74 65 54 69 6d 65  
|k.legacyDateTime|
   004fff80  18 00 00 18 4a 70 61 72  71 75 65 74 2d 6d 72 20  |....Jparquet-mr 
|
   004fff90  76 65 72 73 69 6f 6e 20  31 2e 31 32 2e 33 20 28  |version 1.12.3 
(|
   004fffa0  62 75 69 6c 64 20 66 38  64 63 65 64 31 38 32 63  |build 
f8dced182c|
   004fffb0  34 63 31 66 62 64 65 63  36 63 63 62 33 31 38 35  
|4c1fbdec6ccb3185|
   004fffc0  35 33 37 62 35 61 30 31  65 36 65 64 36 62 29 19  
|537b5a01e6ed6b).|
   004fffd0  dc 1c 00 00 1c 00 00 1c  00 00 1c 00 00 1c 00 00  
|................|
   004fffe0  1c 00 00 1c 00 00 1c 00  00 1c 00 00 1c 00 00 1c  
|................|
   004ffff0  00 00 1c 00 00 1c 00 00  00 e7 0f 00 00 50 41 52  
|.............PAR|
   00500000
   ```
   
   This might be related - we are seeing this issue only on GCP, not AWS. For 
GCP we do disk seeks randomly and on AWS we do disk seeks sequentially. 
   
   We can rerun a job that writes the corrupt parquet file, and it will succeed 
the second time, so it seems to be nondeterministic.
   
   This is on version 1.14.3.
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to