[ 
https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594930#comment-16594930
 ] 

ASF GitHub Bot commented on PARQUET-1401:
-----------------------------------------

gszadovszky closed pull request #104: PARQUET-1401: optional RowGroup fields 
for handling hidden columns
URL: https://github.com/apache/parquet-format/pull/104
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 3a265796..03ec4b29 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -723,6 +723,13 @@ struct RowGroup {
    * The sorting columns can be a subset of all the columns.
    */
   4: optional list<SortingColumn> sorting_columns
+
+  /** Byte offset from beginning of file to first page (data or dictionary)
+   * in this row group **/
+  5: optional i64 file_offset
+
+  /** Total byte size of all compressed column data in this row group **/
+  6: optional i64 total_compressed_size
 }
 
 /** Empty struct to signal the order defined by the physical or logical type */


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> RowGroup offset and total compressed size fields
> ------------------------------------------------
>
>                 Key: PARQUET-1401
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1401
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-cpp, parquet-format
>            Reporter: Gidon Gershinsky
>            Assignee: Gidon Gershinsky
>            Priority: Major
>              Labels: pull-request-available
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, 
> that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first 
> column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all 
> column chunks in the RowGroup, and summing up the size values from each 
> chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the 
> reader), these calculations can't be performed, because the column metadata 
> is protected. 
>  
> But: these calculations don't really need the individual column values. The 
> results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift 
> structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a 
> file with hidden columns (of course, only if the query itself doesn't need 
> the hidden columns - works with a masked version of them, or reads columns 
> with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip 
> the loop upon reading). I've tested this, works fine in Spark writers and 
> readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. 
> There are none - therefore, its the only change we need in parquet.thrift to 
> handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to