corwinjoy commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1901432629

   @emkornfield wrote:
   > @corwinjoy I think we should likely address a issues here before 
proceeding to an implementation:
   > 
   > 1. Do you have a flame-graph or other granular statistics of where the 
parsing is spending time.  I'd imagine a fair bit of it might be in copying 
unneeded string data but having data would help identify the solution space for 
this (again it feels like potentially maintaining a fork of parquet.thrift that 
removes all statistics fields and use generated code from that might help 
improve this if the majority of time is spent copying that data.  Less so if 
the time is spent allocating lists/actually parsing)
   
   see above
   
   > 2. I think the second part of this if IIUC API that make sense for 
communicating that we want to avoid any metadata that doesn't help with reading 
data (i.e. we don't desire any sort of statistics that could help with 
pruning).  This could maybe per a reader property?  It seems the initial PR 
focused on the first row group which seems maybe more specific than something 
we would want?
   
   I'm not sure how much we can reduce this without changing the parquet spec. 
My main argument is that I think that reading all the rowgroups (and some of 
the other metadata) is simply unnecessary to retrieve the data.
   
   > 3. It sounds like some sort of pushdown sampling is desired if we can gain 
efficiencies by doing so in the parquet library vs one of the existing or 
proposed extension points.  For this point are the APIs proposed in 
[[C++][Parquet] support passing a RowRange to RecordBatchReader  
#38865](https://github.com/apache/arrow/issues/38865) sufficient?
   
   The PR listed here is fine as an interface. It suffers from the same problem 
as the benchmarks presented here. Opening the file still has to read the full 
metadata before accessing rowgroups and that can be super-expensive.  The kind 
of optimization presented here would provide internals to avoid reading the 
full metadata but still be able to access rowgroup data.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to