corwinjoy commented on issue #39676: URL: https://github.com/apache/arrow/issues/39676#issuecomment-1901432629
@emkornfield wrote: > @corwinjoy I think we should likely address a issues here before proceeding to an implementation: > > 1. Do you have a flame-graph or other granular statistics of where the parsing is spending time. I'd imagine a fair bit of it might be in copying unneeded string data but having data would help identify the solution space for this (again it feels like potentially maintaining a fork of parquet.thrift that removes all statistics fields and use generated code from that might help improve this if the majority of time is spent copying that data. Less so if the time is spent allocating lists/actually parsing) see above > 2. I think the second part of this if IIUC API that make sense for communicating that we want to avoid any metadata that doesn't help with reading data (i.e. we don't desire any sort of statistics that could help with pruning). This could maybe per a reader property? It seems the initial PR focused on the first row group which seems maybe more specific than something we would want? I'm not sure how much we can reduce this without changing the parquet spec. My main argument is that I think that reading all the rowgroups (and some of the other metadata) is simply unnecessary to retrieve the data. > 3. It sounds like some sort of pushdown sampling is desired if we can gain efficiencies by doing so in the parquet library vs one of the existing or proposed extension points. For this point are the APIs proposed in [[C++][Parquet] support passing a RowRange to RecordBatchReader #38865](https://github.com/apache/arrow/issues/38865) sufficient? The PR listed here is fine as an interface. It suffers from the same problem as the benchmarks presented here. Opening the file still has to read the full metadata before accessing rowgroups and that can be super-expensive. The kind of optimization presented here would provide internals to avoid reading the full metadata but still be able to access rowgroup data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
