emkornfield commented on PR #250: URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2142613709
> @emkornfield I have a prototype with flatbuffers that generates footers of similar size to the current thrift protocol. I left some notes in this PR with some of the optimizations I did to achieve this. I tried to respond to the all the comments. > That said I feel the end state with flatbuffers is much better. The parse time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 respectively, without substantially deviating from the current thrift object model. There is a lot of room for improvements if we deviate. I tend to agree, I think we should pose this question on the ML thread to see what peoples thoughts are on going directly flatbuffers vs something more iterative on thrift. Deviation might be fine, but I think in terms of development effort we should be aiming for something that we can go back and forth without loss of data with the original footer. I think this would likely make it much easier for implementations to adopt flatbuffers and simply translate them back to Thrift structures as a bare minimum of support. > Should we join forces and iterate to make flatbuffers even better, while we collect footers from the fleet to compile a benchmark database to further validate the results? I feel there is a lot of room for shrinking Statistics further at which point we will have both smaller and >10x faster to parse footers. I'm happy to collaborate on the flatbuffers approach. Do you want to open up a draft PR with your current state (maybe update to include the responses here). I think the last major sticking point might be EncodingStatistics and I wonder how much that effects flatbuffers, or if we can use the trick here to embed those within there own string either as flatbuffers or keep them as thrift). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
