commented, mostly on the must/may/shall section, -it's as important to call
out those MUST NOT requirements.

I'm worried about the "should Not substantially degrade performance of old
readers" -I'd put that in the MUST NOT group and define "substantially". If
this slows down existing readers other than a slightly larger end of file
range to read before parsing, it won't be welcome and so less likely to be
adopted.

I also added a security requirement; maybe it should have its own section
primarily as one of due diligence in which illegal/invalid values are
discussed, such as references to different columns referring to overlapping
files -but add that clients are NOT required to check this where the check
is expensive.

It would be good for all readers to add an option to validate the thrift
and flatbuf footers to make sure they are consistent -stop somebody trying
to sneak something malicious deeper into the pipeline where they know that
the front end only checks the thrift values. A full scan of the whole
footer for consistency of offsets again has to be an option. What does
matter is that if my code reads a file from an untrusted source which does
have an inconsistent footer (columns declared as overlapping) this is not
going to generate any exploit. You'd make full-footer-validation part of
the process for ingress of external sources, and from then on consider it
well-formed and consistent across all runtimes.

Steve

(why yes, I am getting more into cybersecurity :)




On Thu, 11 Sept 2025 at 07:43, Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
>
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
>
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
>
> I would prefer comments on the google doc to facilitate async discussion.
>
> Thank you,
>

Reply via email to