[jira] [Commented] (PARQUET-2139) Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files

Timothy Miller (Jira) Wed, 20 Apr 2022 13:24:09 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525264#comment-17525264
 ]


Timothy Miller commented on PARQUET-2139:
-----------------------------------------

I've noticed a few places that could be at fault here. I'm looking at 
1.13.0-SNAPSHOT, for reference.

The first is at line 513 of 
org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(), 
where we find:

{{      ColumnChunk columnChunk = new 
ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the 
right offset}}

I'm pretty sure that this is the wrong thing to put here. The same thing 
(columnMetaData.getFirstDataPageOffset()) is used further down when 
constructing the ColumnMetaData, and since that works properly, this value is 
evidently the location of the PageHeader, not an extra copy of the 
ColumnMetaData. If no extra copy of the ColumnMetaData has been specified, then 
the default constructor should be called. In fact, it may be that we should 
ALWAYS call the default constructor here, since I cannot find any place in the 
code where a pointer to a redundant copy of the ColumnMetaData can even be 
specified.

Secondly, I notice that at line 1179 of 
org.apache.parquet.format.ColumnChunk.write(), the FILE_OFFSET_FIELD_DESC field 
is always written unconditionally to the thrift encoder:

{{      oprot.writeFieldBegin(FILE_OFFSET_FIELD_DESC);}}
{{      oprot.writeI64(struct.file_offset);}}
{{      oprot.writeFieldEnd();}}

This should check and only write the field if it's nonzero.

> Bogus file offset for ColumnMetaData written to ColumnChunk metadata of 
> single parquet files
> --------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2139
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Timothy Miller
>            Priority: Major
>
> In an effort to understand the parquet format better, I've so far written my 
> own Thrift parser, and upon examining the output, I noticed something 
> peculiar.
> To begin with, check out the definition for ColumnChunk here: 
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift]
> You'll notice that if there's an element 2 in the struct, this is supposed to 
> be a file offset to where a redundant copy of the ColumnMetaData.
> Next, have a look at the file called "modified.parquet" attached to 
> https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata 
> at the end of the file, I get this:
> {{Struct(FileMetaData):}}
> {{     1: i32(version) = I32(1)}}
> {{     2: List(SchemaElement schema):}}
> {{          ...
>      3: i64(num_rows) = I64(1)
>      4: List(RowGroup row_groups):
>         1: Struct(RowGroup row_groups):
>            1: List(ColumnChunk columns):
>               1: Struct(ColumnChunk columns):
>                  2: i64(file_offset) = I64(4)
>                  3: Struct(ColumnMetaData meta_data):
>                     1: Type(type) = I32(6) = BYTE_ARRAY
>                     2: List(Encoding encodings):
>                        1: Encoding(encodings) = I32(0) = PLAIN
>                        2: Encoding(encodings) = I32(3) = RLE
>                     3: List(string path_in_schema):
>                        1: string(path_in_schema) = 
> Binary("destination_addresses")
>                        2: string(path_in_schema) = Binary("array")
>                        3: string(path_in_schema) = Binary("element")
>                     4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
>                     5: i64(num_values) = I64(6)
>                     6: i64(total_uncompressed_size) = I64(197)
>                     7: i64(total_compressed_size) = I64(197)
>                     9: i64(data_page_offset) = I64(4)
> }}
> As you can see, element 2 of the ColumnChunk indicates that there is another 
> copy of the ColumnMetaData at offset 4 of the file. But then we see that 
> element 9 of the ColumnMetaData shown above indicates that the data page 
> offset is ALSO 4, where we should find a Thrift encoding of a PageHeader 
> structure. Obviously, both structures can't be in the same place, and in fact 
> a PageHeader is what is located there.
> Based on what I'm seeing here, I believe that element 2 of ColumnChunk should 
> be omitted entirely in this scenario, so as to not falsely indicate that 
> there would be another copy of the ColumnMetadata in this location in the 
> file where indeed something else is present.
> It may take me a while to locate the offending code, but I thought I'd go 
> ahead and point this out before I set off to investigate.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2139) Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files

Reply via email to