Timothy Miller created PARQUET-2139:
---------------------------------------

             Summary: Bogus file offset for ColumnMetaData written to 
ColumnChunk metadata of single parquet files
                 Key: PARQUET-2139
                 URL: https://issues.apache.org/jira/browse/PARQUET-2139
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.12.2
            Reporter: Timothy Miller


In an effort to understand the parquet format better, I've so far written my 
own Thrift parser, and upon examining the output, I noticed something peculiar.

To begin with, check out the definition for ColumnChunk here: 
[https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift]

You'll notice that if there's an element 2 in the struct, this is supposed to 
be a file offset to where a redundant copy of the ColumnMetaData.

Next, have a look at the file called "modified.parquet" attached to 
https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata at 
the end of the file, I get this:

{{Struct(FileMetaData):}}
{{     1: i32(version) = I32(1)}}
{{     2: List(SchemaElement schema):}}
{{          ...
     3: i64(num_rows) = I64(1)
     4: List(RowGroup row_groups):
        1: Struct(RowGroup row_groups):
           1: List(ColumnChunk columns):
              1: Struct(ColumnChunk columns):
                 2: i64(file_offset) = I64(4)
                 3: Struct(ColumnMetaData meta_data):
                    1: Type(type) = I32(6) = BYTE_ARRAY
                    2: List(Encoding encodings):
                       1: Encoding(encodings) = I32(0) = PLAIN
                       2: Encoding(encodings) = I32(3) = RLE
                    3: List(string path_in_schema):
                       1: string(path_in_schema) = 
Binary("destination_addresses")
                       2: string(path_in_schema) = Binary("array")
                       3: string(path_in_schema) = Binary("element")
                    4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
                    5: i64(num_values) = I64(6)
                    6: i64(total_uncompressed_size) = I64(197)
                    7: i64(total_compressed_size) = I64(197)
                    9: i64(data_page_offset) = I64(4)
}}

As you can see, element 2 of the ColumnChunk indicates that there is another 
copy of the ColumnMetaData at offset 4 of the file. But then we see that 
element 9 of the ColumnMetaData shown above indicates that the data page offset 
is ALSO 4, where we should find a Thrift encoding of a PageHeader structure. 
Obviously, both structures can't be in the same place, and in fact a PageHeader 
is what is located there.

Based on what I'm seeing here, I believe that element 2 of ColumnChunk should 
be omitted entirely in this scenario, so as to not falsely indicate that there 
would be another copy of the ColumnMetadata in this location in the file where 
indeed something else is present.

It may take me a while to locate the offending code, but I thought I'd go ahead 
and point this out before I set off to investigate.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to