[ https://issues.apache.org/jira/browse/PARQUET-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525229#comment-17525229 ]
Timothy Miller commented on PARQUET-2139: ----------------------------------------- Of course, I'll be embarrassed if this turns out to just be a bug in my own thrift parser, but everything seems to line up so far. > Bogus file offset for ColumnMetaData written to ColumnChunk metadata of > single parquet files > -------------------------------------------------------------------------------------------- > > Key: PARQUET-2139 > URL: https://issues.apache.org/jira/browse/PARQUET-2139 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.2 > Reporter: Timothy Miller > Priority: Major > > In an effort to understand the parquet format better, I've so far written my > own Thrift parser, and upon examining the output, I noticed something > peculiar. > To begin with, check out the definition for ColumnChunk here: > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift] > You'll notice that if there's an element 2 in the struct, this is supposed to > be a file offset to where a redundant copy of the ColumnMetaData. > Next, have a look at the file called "modified.parquet" attached to > https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata > at the end of the file, I get this: > {{Struct(FileMetaData):}} > {{ 1: i32(version) = I32(1)}} > {{ 2: List(SchemaElement schema):}} > {{ ... > 3: i64(num_rows) = I64(1) > 4: List(RowGroup row_groups): > 1: Struct(RowGroup row_groups): > 1: List(ColumnChunk columns): > 1: Struct(ColumnChunk columns): > 2: i64(file_offset) = I64(4) > 3: Struct(ColumnMetaData meta_data): > 1: Type(type) = I32(6) = BYTE_ARRAY > 2: List(Encoding encodings): > 1: Encoding(encodings) = I32(0) = PLAIN > 2: Encoding(encodings) = I32(3) = RLE > 3: List(string path_in_schema): > 1: string(path_in_schema) = > Binary("destination_addresses") > 2: string(path_in_schema) = Binary("array") > 3: string(path_in_schema) = Binary("element") > 4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED > 5: i64(num_values) = I64(6) > 6: i64(total_uncompressed_size) = I64(197) > 7: i64(total_compressed_size) = I64(197) > 9: i64(data_page_offset) = I64(4) > }} > As you can see, element 2 of the ColumnChunk indicates that there is another > copy of the ColumnMetaData at offset 4 of the file. But then we see that > element 9 of the ColumnMetaData shown above indicates that the data page > offset is ALSO 4, where we should find a Thrift encoding of a PageHeader > structure. Obviously, both structures can't be in the same place, and in fact > a PageHeader is what is located there. > Based on what I'm seeing here, I believe that element 2 of ColumnChunk should > be omitted entirely in this scenario, so as to not falsely indicate that > there would be another copy of the ColumnMetadata in this location in the > file where indeed something else is present. > It may take me a while to locate the offending code, but I thought I'd go > ahead and point this out before I set off to investigate. -- This message was sent by Atlassian Jira (v8.20.7#820007)