emkornfield commented on PR #250: URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2140331049
> Now that I understand this better it seems that the main trick we are employing is converting `list<ExpensiveToDecodePerColumnStruct>` to something like `list<binary>` (with optional compression) to make deserialization lazy. > > 1. How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit? No it does not. How much regression is present is something we need to benchmark. > 2. The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers? I think we might have different priors on effort involved and scope of changes for readers. I was hoping with these changes existing readers could update to use it without the benefit of laziness in O(hours) - O(days) (e.g. if they happen to be exposing thrift structures as a public API in there system like DuckDB). If systems already have abstractions in place for metadata making use of the laziness should also be pretty quick. I see changing to a different encoding format as more efforts and when multiplied across the number of custom parquet readers makes it less likely for ecosystem penetration or at least take longer. I don't want to rule out moving to flatbuffers in the future (this is yet another reason for the feature bitmap). CC @JFinis I think your perspective here might be valuable here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
