[GitHub] [arrow] lidavidm commented on pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

GitBox Sat, 20 Mar 2021 17:12:49 -0700


lidavidm commented on pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#issuecomment-803488875



   To share the context for this refactor, Ben has this doc: 
https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing
   
   > This is something we need to do; dask and other advanced parquet consumers 
need ridiculously sophisticated hooks for scanning (let alone writing). For 
example: whether to populate statistics (for reading into a single table with 
no filter there is no point in converting statistics to expressions), whether 
they should be accumulated or cached (cudf folks wanted to copy the unparsed 
metadata buffers to the GPU), conversion details (`dict_columns` might be 
interactively decided when a string column is discovered to have few distinct 
values), I/O minutiae (stream block size/buffering/chunking/... might be 
decided after a scan starts taking too long), ...
   
   Depending on what everyone thinks, I may revisit the implementation, but 
yes, let's try to present a convenient API for R/Python users, and have a nice 
feature to announce for 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

Reply via email to