[ 
https://issues.apache.org/jira/browse/PARQUET-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561159#comment-14561159
 ] 

Konstantin Shaposhnikov commented on PARQUET-291:
-------------------------------------------------

It also looks like Impala got it almost right ;)

In the file from parquet-compatibility project (
https://github.com/Parquet/parquet-compatibility/blob/master/parquet-testdata/impala/1.1.1-NONE/nation.impala.parquet)
 *ColumnChunk.file_offset + 4* points to the serialized ColumnMetaData. I 
assume +4 because of PAR1 at the beginning of the file.

I guess both Impala and parquet-mr never read ColumnChunk.file_offset and do 
not use ColumnMetaData that is written after ColumnChunk data and always rely 
on optional ColumnMetaData from FileMetaData.

It might be good idea to update documentation.

> Difference between parquet-mr implementation and parquet-format documentation
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-291
>                 URL: https://issues.apache.org/jira/browse/PARQUET-291
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format, parquet-mr
>    Affects Versions: 1.6.1
>            Reporter: Konstantin Shaposhnikov
>
> Documentation at 
> https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift
> {noformat}
> struct ColumnChunk {
>   /** File where column data is stored.  If not set, assumed to be same file 
> as
>     * metadata.  This path is relative to the current file.
>     **/
>   1: optional string file_path
>   /** Byte offset in file_path to the ColumnMetaData **/
>   2: required i64 file_offset
> ...
> {noformat}
> and https://github.com/apache/parquet-format
> {noformat}
> 4-byte magic number "PAR1"
> <Column 1 Chunk 1 + Column Metadata>
> <Column 2 Chunk 1 + Column Metadata>
> ...
> {noformat}
> suggests that ColumnChunk data should be followed by ColumnChunkMetaData.
> However it looks like parquet-mr doesn't write ColumnMetaData after Columns 
> at all and populates ColumnChunk.file_offset with an offset of the first data 
> page:
> from *ParquetMetadataConverter.java:153*:
> {code}
>     for (ColumnChunkMetaData columnMetaData : columns) {
>       ColumnChunk columnChunk = new 
> ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the 
> right offset
>       columnChunk.file_path = block.getPath(); // they are in the same file 
> for now
> {code}
>  Is it a bug in parquet-mr or in the documentation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to