[
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nandor Kollar updated PARQUET-1401:
-----------------------------------
Fix Version/s: encryption-feature-branch
> RowGroup offset and total compressed size fields
> ------------------------------------------------
>
> Key: PARQUET-1401
> URL: https://issues.apache.org/jira/browse/PARQUET-1401
> Project: Parquet
> Issue Type: Sub-task
> Components: parquet-cpp, parquet-format
> Reporter: Gidon Gershinsky
> Assignee: Gidon Gershinsky
> Priority: Major
> Labels: pull-request-available
> Fix For: encryption-feature-branch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class,
> that calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all
> column chunks in the RowGroup, and summing up the size values from each
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the
> reader), these calculations can't be performed, because the column metadata
> is protected.
>
> But: these calculations don't really need the individual column values. The
> results pertain to the whole RowGroup, not specific columns.
> Therefore, we will define two new optional fields in the RowGroup Thrift
> structure:
>
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>
> and calculate/set them upon file writing. Then, Spark will be able to query a
> file with hidden columns (of course, only if the query itself doesn't need
> the hidden columns - works with a masked version of them, or reads columns
> with available keys).
>
> These values can be set only for encrypted files (or for all files, to skip
> the loop upon reading). I've tested this, works fine in Spark writers and
> readers.
>
> I've also checked other references to ColumnMetaData fields in parquet-mr.
> There are none - therefore, its the only change we need in parquet.thrift to
> handle hidden columns.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)