emkornfield commented on PR #250:
URL: https://github.com/apache/parquet-format/pull/250#issuecomment-2142613709

   > @emkornfield I have a prototype with flatbuffers that generates footers of 
similar size to the current thrift protocol. I left some notes in this PR with 
some of the optimizations I did to achieve this.
   
   I tried to respond to the all the comments.
   
   > That said I feel the end state with flatbuffers is much better. The parse 
time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 
respectively, without substantially deviating from the current thrift object 
model. There is a lot of room for improvements if we deviate.
   
   I tend to agree, I think we should pose this question on the ML thread to 
see what peoples thoughts are on  going directly flatbuffers vs something more 
iterative on thrift.  Deviation might be fine, but I think in terms of 
development effort we should be aiming for something that we can go back and 
forth without loss of data with the original footer.  I think this would likely 
make it much easier for implementations to adopt flatbuffers and simply 
translate them back to Thrift structures as a bare minimum of support.
   
   > Should we join forces and iterate to make flatbuffers even better, while 
we collect footers from the fleet to compile a benchmark database to further 
validate the results? I feel there is a lot of room for shrinking Statistics 
further at which point we will have both smaller and >10x faster to parse 
footers.
   
   I'm happy to collaborate on the flatbuffers approach.  Do you want to open 
up a draft PR with your current state (maybe update to include the responses 
here).  I think the last major sticking point might be EncodingStatistics and I 
wonder how much that effects flatbuffers, or if we can use the trick here to 
embed those within there own string either as flatbuffers or keep them as 
thrift).  
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to