alkis commented on PR #250: URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2139855539
> @alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches. Apologies that was my misunderstanding. I conflated #242 which has offsets for metadata elsewhere. Now that I understand this better it seems that the main trick we are employing is converting `list<ExpensiveToDecodePerColumnStruct>` to something like `list<binary>` (with optional compression) to make deserialization lazy. 1. How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit? 2. The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
