alkis commented on PR #250:
URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2139855539

   > @alkis I think I might have misphrased something here or misdesigned 
something here. Where do you see the require second fetch as a requirement. I 
imagine performant readers will still read a significant portion of the tail of 
the file to avoid the two fetches.
   
   Apologies that was my misunderstanding. I conflated #242 which has offsets 
for metadata elsewhere.
   
   Now that I understand this better it seems that the main trick we are 
employing is converting `list<ExpensiveToDecodePerColumnStruct>` to something 
like `list<binary>` (with optional compression) to make deserialization lazy.
   1. How is this going to affect workloads that read all the columns (and thus 
require all the metadata). Are they getting any benefit?
   2. The effort to fix readers is not going to be trivial. Perhaps it will be 
similar if we revamp the schema substantially. Why should we stick to thrift if 
we are going through the effort to refactor readers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to