commented, mostly on the must/may/shall section, -it's as important to call out those MUST NOT requirements.
I'm worried about the "should Not substantially degrade performance of old readers" -I'd put that in the MUST NOT group and define "substantially". If this slows down existing readers other than a slightly larger end of file range to read before parsing, it won't be welcome and so less likely to be adopted. I also added a security requirement; maybe it should have its own section primarily as one of due diligence in which illegal/invalid values are discussed, such as references to different columns referring to overlapping files -but add that clients are NOT required to check this where the check is expensive. It would be good for all readers to add an option to validate the thrift and flatbuf footers to make sure they are consistent -stop somebody trying to sneak something malicious deeper into the pipeline where they know that the front end only checks the thrift values. A full scan of the whole footer for consistency of offsets again has to be an option. What does matter is that if my code reads a file from an untrusted source which does have an inconsistent footer (columns declared as overlapping) this is not going to generate any exploit. You'd make full-footer-validation part of the process for ingress of external sources, and from then on consider it well-formed and consistent across all runtimes. Steve (why yes, I am getting more into cybersecurity :) On Thu, 11 Sept 2025 at 07:43, Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Hi all. I am sharing as a separate thread the proposal for the footer > change we have been working on: > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit > . > > The proposal outlines the technical aspects of the design and the > experimental results of shadow testing this in production workloads. I > would like to discuss the proposal's most salient points in the next sync: > 1. the use of flatbuffers as footer serialization format > 2. the additional limitations imposed on parquet files (row group size > limit, row group max num row limit) > > I would prefer comments on the google doc to facilitate async discussion. > > Thank you, >