[GitHub] [arrow] houqp opened a new pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

GitBox Sun, 17 May 2020 16:35:12 -0700


houqp opened a new pull request #7210:
URL: https://github.com/apache/arrow/pull/7210



   This PR changes schema argument for scan_csv method into `Option<&Schema>`. 
Other related changes are needed to make this happen including:
   
   * added delimiter argument to all csv related structs and functions
   * fixed a bug in schema field inference function
   * made `arrow::csv::reader::infer_file_schema` public so it can be used by 
data fusion
   
   Known limitations: 
   * when provided with a directory of csv files, schema inference code only 
reads rows from the first file.
   * to avoid adding yet another argument to all csv related functions, i hard 
coded number of rows to read for schema inference to 100
   
   Open questions:
   * Should we rename `datasource::csv::CsvFile` struct to `CsvTable` to keep 
it consistent with ParquetTable and MemoryTable? The implementation of CsvFile 
also supports reading from a directory of files, so `CsvFile` is not an 
accurate name.
   * csv related function arguments are getting a bit long, should we introduce 
a csv option struct to capture the following configs with sensible defaults?
     - schema
     - has_header
     - delimiter
     - infer_max_read_records
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] houqp opened a new pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

Reply via email to