> > To be clear, I agree that we need to check that our various validation > and integration suites pass properly. But once that is done and > assuming all the metadata variations are properly tested, data > variations should not pose any problem. >
Unless I'm misunderstanding your proposal, that doesn't deal with the data that has already been produced that may have been written in a way that this change finds non-consumable but works today. By doing things at the format level, there is no way for flatbuf to parse data that doesn't comply. > The write side is irrelevant here, since the concern is to protect > reliably against invalid input (especially due to malicious intent). > Not really. If this had been enforced on the write side since day 1, enforcing on the read side now would be a noop. If we started enforcing this on the write side today across all languages then it would make it more feasible to incorporate into the read side six months or a year from now (as data ages out). I don't know about others but our use of persisted Arrow flatbuf serialization is primarily focused on fairly short shelf-life datasets (months more than years). > Of course, we can hand-write all the NULL checks on the read side. My > concern is not the one-time cost of doing so, but the long-term > fragility of such a strategy I agree with you in principle about using tools rather than humans to minimize mistakes. On the flipside, we chose to use optional for the same reason that flatbuf defaults to optional, protobuf2 recommended use of optional over required and protobuf3 removed the ability to express things as required [1]. > (every refactor or format addition is a > threat to the robustness of the IPC reader). Any format additions can be implemented however we want (required, optional, etc) so I don't see that as related to the issue at hand. > I don't think a potential > long-standing history of security issues in Arrow would help adoption. This is a strawman argument. I also think we should avoid having a long-standing history of security issues. [1] https://github.com/protocolbuffers/protobuf/issues/2497