emkornfield commented on PR #250:
URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2140331049

   > Now that I understand this better it seems that the main trick we are 
employing is converting `list<ExpensiveToDecodePerColumnStruct>` to something 
like `list<binary>` (with optional compression) to make deserialization lazy.
   > 
   > 1. How is this going to affect workloads that read all the columns (and 
thus require all the metadata). Are they getting any benefit?
   
   No it does not.  How much regression is present is something we need to 
benchmark.
   
   > 2. The effort to fix readers is not going to be trivial. Perhaps it will 
be similar if we revamp the schema substantially. Why should we stick to thrift 
if we are going through the effort to refactor readers?
   
   I think we might have different priors on effort involved and scope of 
changes for readers.  I was hoping with these changes existing readers could 
update to use it without the benefit of laziness in O(hours) - O(days) (e.g. if 
they happen to be exposing thrift structures as a public API in there system 
like DuckDB).  If systems already have abstractions in place for metadata 
making use of the laziness should also be pretty quick.  I see changing to a 
different encoding format as more efforts and when multiplied across the number 
of custom parquet readers makes it less likely for ecosystem penetration or at 
least take longer.  I don't want to rule out moving to flatbuffers in the 
future (this is yet another reason for the feature bitmap).
   
   CC @JFinis I think your perspective here might be valuable here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to