Eunsoo Roh created PARQUET-68:
---------------------------------

             Summary: Incompatible behavior for ColumnChunk.file_offset between 
Parquet-mr and Impala
                 Key: PARQUET-68
                 URL: https://issues.apache.org/jira/browse/PARQUET-68
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format, parquet-mr
            Reporter: Eunsoo Roh


According to comments in 
[parquet.thrift|https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift#L479],
 this field is supposed to store offset of ColumnMetaData within the file 
column chunk is stored. My understanding is that this allows omitting 
ColumnMetaData within ColumnChunk (it is optional field, after all). 
Unfortunately, two major implementations, Parquet-mr and Impala, deviate from 
this definition when writing Parquet files. Impala implementation writes offset 
pointing to the ColumnChunk rather than ColumnMetaData, as can be found in 
[hdfs-parquet-table-reader.cc|https://github.com/cloudera/Impala/blob/24db37f4efdc493d218470dc045b61f5104c4fd0/be/src/exec/hdfs-parquet-table-writer.cc#L895].
 While this is still incorrect behavior according to the comments in 
parquet.thrift, this still allows access to the ColumnMetaData necessary for 
reading data.

Parquet-mr implementation can be found in 
[ParquetMetadataConverter|https://github.com/Parquet/parquet-mr/blob/fd8d18f26af9ad7813dda71352b5dcb0080306eb/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L149],
 which writes the offset to the first data page. Not only this is incompatible 
behavior, but also it makes no sense because you cannot read the data with just 
data page offset. There is even a comment on that line saying "verify this is 
the right offset."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to