Konstantin Shaposhnikov created PARQUET-291:
-----------------------------------------------

             Summary: Difference between parquet-mr implementation and 
parquet-format documentation
                 Key: PARQUET-291
                 URL: https://issues.apache.org/jira/browse/PARQUET-291
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format, parquet-mr
    Affects Versions: 1.6.1
            Reporter: Konstantin Shaposhnikov


Documentation at 
https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift

{noformat}
struct ColumnChunk {
  /** File where column data is stored.  If not set, assumed to be same file as
    * metadata.  This path is relative to the current file.
    **/
  1: optional string file_path

  /** Byte offset in file_path to the ColumnMetaData **/
  2: required i64 file_offset

...
{noformat}

and https://github.com/apache/parquet-format

{noformat}
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
{noformat}

suggests that ColumnChunk data should be followed by ColumnChunkMetaData.

However it looks like parquet-mr doesn't write ColumnMetaData after Columns at 
all and populates ColumnChunk.file_offset with an offset of the first data page:

from *ParquetMetadataConverter.java:153*:
{code}
    for (ColumnChunkMetaData columnMetaData : columns) {
      ColumnChunk columnChunk = new 
ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the 
right offset
      columnChunk.file_path = block.getPath(); // they are in the same file for 
now

{code}

 Is it a bug in parquet-mr or in the documentation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to