[ 
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1401:
-----------------------------------
    Labels: pull-request-available  (was: )

> RowGroup offset and total compressed size fields
> ------------------------------------------------
>
>                 Key: PARQUET-1401
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1401
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-cpp, parquet-format
>            Reporter: Gidon Gershinsky
>            Assignee: Gidon Gershinsky
>            Priority: Major
>              Labels: pull-request-available
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, 
> that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first 
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all 
> column chunks in the RowGroup, and summing up the size values from each 
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the 
> reader), these calculations can't be performed, because the column metadata 
> is protected. 
>  
> But: these calculations don't really need the individual column values. The 
> results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift 
> structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a 
> file with hidden columns (of course, only if the query itself doesn't need 
> the hidden columns - works with a masked version of them, or reads columns 
> with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip 
> the loop upon reading). I've tested this, works fine in Spark writers and 
> readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. 
> There are none - therefore, its the only change we need in parquet.thrift to 
> handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to