[
https://issues.apache.org/jira/browse/PARQUET-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525229#comment-17525229
]
Timothy Miller commented on PARQUET-2139:
-----------------------------------------
Of course, I'll be embarrassed if this turns out to just be a bug in my own
thrift parser, but everything seems to line up so far.
> Bogus file offset for ColumnMetaData written to ColumnChunk metadata of
> single parquet files
> --------------------------------------------------------------------------------------------
>
> Key: PARQUET-2139
> URL: https://issues.apache.org/jira/browse/PARQUET-2139
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.12.2
> Reporter: Timothy Miller
> Priority: Major
>
> In an effort to understand the parquet format better, I've so far written my
> own Thrift parser, and upon examining the output, I noticed something
> peculiar.
> To begin with, check out the definition for ColumnChunk here:
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift]
> You'll notice that if there's an element 2 in the struct, this is supposed to
> be a file offset to where a redundant copy of the ColumnMetaData.
> Next, have a look at the file called "modified.parquet" attached to
> https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata
> at the end of the file, I get this:
> {{Struct(FileMetaData):}}
> {{ 1: i32(version) = I32(1)}}
> {{ 2: List(SchemaElement schema):}}
> {{ ...
> 3: i64(num_rows) = I64(1)
> 4: List(RowGroup row_groups):
> 1: Struct(RowGroup row_groups):
> 1: List(ColumnChunk columns):
> 1: Struct(ColumnChunk columns):
> 2: i64(file_offset) = I64(4)
> 3: Struct(ColumnMetaData meta_data):
> 1: Type(type) = I32(6) = BYTE_ARRAY
> 2: List(Encoding encodings):
> 1: Encoding(encodings) = I32(0) = PLAIN
> 2: Encoding(encodings) = I32(3) = RLE
> 3: List(string path_in_schema):
> 1: string(path_in_schema) =
> Binary("destination_addresses")
> 2: string(path_in_schema) = Binary("array")
> 3: string(path_in_schema) = Binary("element")
> 4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
> 5: i64(num_values) = I64(6)
> 6: i64(total_uncompressed_size) = I64(197)
> 7: i64(total_compressed_size) = I64(197)
> 9: i64(data_page_offset) = I64(4)
> }}
> As you can see, element 2 of the ColumnChunk indicates that there is another
> copy of the ColumnMetaData at offset 4 of the file. But then we see that
> element 9 of the ColumnMetaData shown above indicates that the data page
> offset is ALSO 4, where we should find a Thrift encoding of a PageHeader
> structure. Obviously, both structures can't be in the same place, and in fact
> a PageHeader is what is located there.
> Based on what I'm seeing here, I believe that element 2 of ColumnChunk should
> be omitted entirely in this scenario, so as to not falsely indicate that
> there would be another copy of the ColumnMetadata in this location in the
> file where indeed something else is present.
> It may take me a while to locate the offending code, but I thought I'd go
> ahead and point this out before I set off to investigate.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)