emkornfield commented on issue #39676: URL: https://github.com/apache/arrow/issues/39676#issuecomment-1900010135
@corwinjoy I think we should likely address a issues here before proceeding to an implementation: 1. Do you have a flame-graph or other granular statistics of where the parsing is spending time. I'd imagine a fair bit of it might be in copying unneeded string data but having data would help identify the solution space for this (again it feels like potentially maintaining a fork of parquet.thrift that removes all statistics fields and use generated code from that might help improve this if the majority of time is spent copying that data. Less so if the time is spent allocating lists/actually parsing) 2. I think the second part of this if IIUC API that make sense for communicating that we want to avoid any metadata that doesn't help with reading data (i.e. we don't desire any sort of statistics that could help with pruning). This could maybe per a reader property? It seems the initial PR focused on the first row group which seems maybe more specific than something we would want? 3. It sounds like some sort of pushdown sampling is desired if we can gain efficiencies by doing so in the parquet library vs one of the existing or proposed extension points. For this point are the APIs proposed in https://github.com/apache/arrow/issues/38865 sufficient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
