emkornfield commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1900010135

   @corwinjoy I think we should likely address a  issues here before proceeding 
to an implementation:
   
   1.  Do you have a flame-graph or other granular statistics of where the 
parsing is spending time.  I'd imagine a fair bit of it might be in copying 
unneeded string data but having data would help identify the solution space for 
this (again it feels like potentially maintaining a fork of parquet.thrift that 
removes all statistics fields and use generated code from that might help 
improve this if the majority of time is spent copying that data.  Less so if 
the time is spent allocating lists/actually parsing)
   2. I think the second part of this if IIUC API that make sense for 
communicating that we want to avoid any metadata that doesn't help with reading 
data (i.e. we don't desire any sort of statistics that could help with 
pruning).  This could maybe per a reader property?  It seems the initial PR 
focused on the first row group which seems maybe more specific than something 
we would want?
   3. It sounds like some sort of pushdown sampling is desired if we can gain 
efficiencies by doing so in the parquet library vs one of the existing or 
proposed extension points.  For this point are the APIs proposed in 
https://github.com/apache/arrow/issues/38865 sufficient?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to