Two reasons: 1. All column-specific metadata is specified today in RowGroups. Things like compression algorithms (which, I believe, in theory could be column-specific, but usually are file-wide) are repeatedly set in all columns in all row groups. 2. Not less (and probably more) important - in the future, we'll likely add support for using different encryption keys for different row groups. This scenario is raised from time to time in discussions. For example, a time series scenario - a user has access to a certain time span, but not all data (eg to enable user access revocation at some point in time). Another scenario is using row groups for other types of data grouping in large files, eg by geography etc - different keys will allow for a corresponding access control.
[ Full content available at: https://github.com/apache/parquet-format/pull/110 ] This message was relayed via gitbox.apache.org for [email protected]
