[ 
https://issues.apache.org/jira/browse/PARQUET-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525229#comment-17525229
 ] 

Timothy Miller commented on PARQUET-2139:
-----------------------------------------

Of course, I'll be embarrassed if this turns out to just be a bug in my own 
thrift parser, but everything seems to line up so far.

> Bogus file offset for ColumnMetaData written to ColumnChunk metadata of 
> single parquet files
> --------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2139
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Timothy Miller
>            Priority: Major
>
> In an effort to understand the parquet format better, I've so far written my 
> own Thrift parser, and upon examining the output, I noticed something 
> peculiar.
> To begin with, check out the definition for ColumnChunk here: 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift]
> You'll notice that if there's an element 2 in the struct, this is supposed to 
> be a file offset to where a redundant copy of the ColumnMetaData.
> Next, have a look at the file called "modified.parquet" attached to 
> https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata 
> at the end of the file, I get this:
> {{Struct(FileMetaData):}}
> {{     1: i32(version) = I32(1)}}
> {{     2: List(SchemaElement schema):}}
> {{          ...
>      3: i64(num_rows) = I64(1)
>      4: List(RowGroup row_groups):
>         1: Struct(RowGroup row_groups):
>            1: List(ColumnChunk columns):
>               1: Struct(ColumnChunk columns):
>                  2: i64(file_offset) = I64(4)
>                  3: Struct(ColumnMetaData meta_data):
>                     1: Type(type) = I32(6) = BYTE_ARRAY
>                     2: List(Encoding encodings):
>                        1: Encoding(encodings) = I32(0) = PLAIN
>                        2: Encoding(encodings) = I32(3) = RLE
>                     3: List(string path_in_schema):
>                        1: string(path_in_schema) = 
> Binary("destination_addresses")
>                        2: string(path_in_schema) = Binary("array")
>                        3: string(path_in_schema) = Binary("element")
>                     4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
>                     5: i64(num_values) = I64(6)
>                     6: i64(total_uncompressed_size) = I64(197)
>                     7: i64(total_compressed_size) = I64(197)
>                     9: i64(data_page_offset) = I64(4)
> }}
> As you can see, element 2 of the ColumnChunk indicates that there is another 
> copy of the ColumnMetaData at offset 4 of the file. But then we see that 
> element 9 of the ColumnMetaData shown above indicates that the data page 
> offset is ALSO 4, where we should find a Thrift encoding of a PageHeader 
> structure. Obviously, both structures can't be in the same place, and in fact 
> a PageHeader is what is located there.
> Based on what I'm seeing here, I believe that element 2 of ColumnChunk should 
> be omitted entirely in this scenario, so as to not falsely indicate that 
> there would be another copy of the ColumnMetadata in this location in the 
> file where indeed something else is present.
> It may take me a while to locate the offending code, but I thought I'd go 
> ahead and point this out before I set off to investigate.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to