asfimport commented on issue #42298: URL: https://github.com/apache/arrow/issues/42298#issuecomment-2184204672
[Wes McKinney](https://issues.apache.org/jira/browse/PARQUET-816?#comment-15773286) / @wesm: @mrocklin I tracked down the source of this bug. There's a bug in parquet-mr 1.2.8 and lower in which the column chunk metadata in the Parquet file is incorrect. Impala inserted an explicit workaround for this (see See https://github.com/apache/incubator-impala/blob/88448d1d4ab31eaaf82f764b36dc7d11d4c63c32/be/src/exec/hdfs-parquet-scanner.cc#L1227). You didn't hit this bug in the fastparquet Python implementation because you aren't using the `total_compressed_size` field to read the entire column chunk into memory before beginning decoding. In this particular file, the dictionary page header is 15 bytes, and the entire column chunk is: 15 (dict page header) + 277 (dictionary) + 17 (data page header) + 28 (data page) bytes, making 337 bytes. But the metadata says the column chunk is only 322 bytes – the dict page header size got dropped from the accounting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
