lidavidm commented on pull request #9725: URL: https://github.com/apache/arrow/pull/9725#issuecomment-803488875
To share the context for this refactor, Ben has this doc: https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing > This is something we need to do; dask and other advanced parquet consumers need ridiculously sophisticated hooks for scanning (let alone writing). For example: whether to populate statistics (for reading into a single table with no filter there is no point in converting statistics to expressions), whether they should be accumulated or cached (cudf folks wanted to copy the unparsed metadata buffers to the GPU), conversion details (`dict_columns` might be interactively decided when a string column is discovered to have few distinct values), I/O minutiae (stream block size/buffering/chunking/... might be decided after a scan starts taking too long), ... Depending on what everyone thinks, I may revisit the implementation, but yes, let's try to present a convenient API for R/Python users, and have a nice feature to announce for 4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
