asfimport commented on issue #42298:
URL: https://github.com/apache/arrow/issues/42298#issuecomment-2184204672

   [Wes 
McKinney](https://issues.apache.org/jira/browse/PARQUET-816?#comment-15773286) 
/ @wesm:
   @mrocklin I tracked down the source of this bug. 
   
   There's a bug in parquet-mr 1.2.8 and lower in which the column chunk 
metadata in the Parquet file is incorrect. Impala inserted an explicit 
workaround for this (see See 
https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227).
 You didn't hit this bug in the fastparquet Python implementation because you 
aren't using the `total_compressed_size` field to read the entire column chunk 
into memory before beginning decoding.
   
   In this particular file, the dictionary page header is 15 bytes, and the 
entire column chunk is:
   
   15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data 
page) bytes, making 337 bytes. 
   
   But the metadata says the column chunk is only 322 bytes – the dict page 
header size got dropped from the accounting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to