emkornfield opened a new pull request, #250:
URL: https://github.com/apache/parquet-format/pull/250

   As a point of discussion a slightly different version of showing how column 
metadata could be decoupled from FileMetadata then 
https://github.com/apache/parquet-format/pull/242
   
   In particular this takes a slightly different approach:
   1.  It introduces a new random access encoding for Parquet to store 
serialized data instead of relying a one-off index scheme based in the thrift 
structure.  By taking this approach is allows flexibility for implemenations to 
further balance size vs compute trade-offs and can potentially make use of any 
further encoding improvements in the future.  Two downside of this approach is 
it requires a little bit more work up-front and has slightly more overhead then 
directly doing this as thrift structures.
   2. It places the serialized data page completely outside of thrift metadata 
and instead provides an offset within the footer. This is mostly a 
micro-optimization (likely not critical) to allow parquet implementors to avoid 
unnecessary copies of string data if the Thrift library supporting it does not 
allow it.  There is no reason that the pages could not be inlined as a "binary" 
field in FileMetadata as is done in 
https://github.com/apache/parquet-format/pull/242
   3. Moves a few other fields out of FileMetadata into a metadata page and 
raises discussion points on others.
   4. Re-uses existing Thrift objects in attempt to make the transition easier 
for implementors.
   
   
   Things it does not do:
   1.  Enumerate all fields that should be deprecated 
https://github.com/apache/parquet-format/pull/242 is a good start and can 
consolidated on a final list once a general approach is taken.
   2. Incorporate changes in https://github.com/apache/parquet-format/pull/248 
these also likely make sense but can be incorporated into any final proposal.
   3. Micro-optimizations to separate scan use cases from filter evaluation 
use-cases (ColumnChunk structure could potentially be further split apart to 
give finer grained access to elements that are only needed in once case or 
another).  
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to