Konstantin Shaposhnikov created PARQUET-291:
-----------------------------------------------
Summary: Difference between parquet-mr implementation and
parquet-format documentation
Key: PARQUET-291
URL: https://issues.apache.org/jira/browse/PARQUET-291
Project: Parquet
Issue Type: Bug
Components: parquet-format, parquet-mr
Affects Versions: 1.6.1
Reporter: Konstantin Shaposhnikov
Documentation at
https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift
{noformat}
struct ColumnChunk {
/** File where column data is stored. If not set, assumed to be same file as
* metadata. This path is relative to the current file.
**/
1: optional string file_path
/** Byte offset in file_path to the ColumnMetaData **/
2: required i64 file_offset
...
{noformat}
and https://github.com/apache/parquet-format
{noformat}
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
{noformat}
suggests that ColumnChunk data should be followed by ColumnChunkMetaData.
However it looks like parquet-mr doesn't write ColumnMetaData after Columns at
all and populates ColumnChunk.file_offset with an offset of the first data page:
from *ParquetMetadataConverter.java:153*:
{code}
for (ColumnChunkMetaData columnMetaData : columns) {
ColumnChunk columnChunk = new
ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the
right offset
columnChunk.file_path = block.getPath(); // they are in the same file for
now
{code}
Is it a bug in parquet-mr or in the documentation?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)