Gidon Gershinsky created PARQUET-1401:
-----------------------------------------

             Summary: RowGroup offset and total compressed size fields
                 Key: PARQUET-1401
                 URL: https://issues.apache.org/jira/browse/PARQUET-1401
             Project: Parquet
          Issue Type: Sub-task
          Components: parquet-cpp, parquet-format
            Reporter: Gidon Gershinsky
            Assignee: Gidon Gershinsky


Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, that  
calculate the offset and total compressed size of a RowGroup data.
The offset calculation is done by extracting the ColumnMetaData of the first 
column, and using its offset fields.
The total compressed size calculation is done by running a loop over all column 
chunks in the RowGroup, and summing up the size values from each chunk's 
ColumnMetaData .
If one or more columns are hidden (encrypted with a key unavailable to the 
reader), these calculations can't be performed, because the column metadata is 
protected. 
 
But: these calculations don't really need the individual column values. The 
results pertain to the whole RowGroup, not specific columns. 
Therefore, we will define two new optional fields in the RowGroup Thrift 
structure:
 
_optional i64 file_offset_
_optional i64 total_compressed_size_
 
and calculate/set them upon file writing. Then, Spark will be able to query a 
file with hidden columns (of course, only if the query itself doesn't need the 
hidden columns - works with a masked version of them, or reads columns with 
available keys).
 
These values can be set only for encrypted files (or for all files, to skip the 
loop upon reading). I've tested this, works fine in Spark writers and readers.
 
I've also checked other references to ColumnMetaData fields in parquet-mr. 
There are none - therefore, its the only change we need in parquet.thrift to 
handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to