houqp opened a new pull request #7210:
URL: https://github.com/apache/arrow/pull/7210
This PR changes schema argument for scan_csv method into `Option<&Schema>`.
Other related changes are needed to make this happen including:
* added delimiter argument to all csv related structs and functions
* fixed a bug in schema field inference function
* made `arrow::csv::reader::infer_file_schema` public so it can be used by
data fusion
Known limitations:
* when provided with a directory of csv files, schema inference code only
reads rows from the first file.
* to avoid adding yet another argument to all csv related functions, i hard
coded number of rows to read for schema inference to 100
Open questions:
* Should we rename `datasource::csv::CsvFile` struct to `CsvTable` to keep
it consistent with ParquetTable and MemoryTable? The implementation of CsvFile
also supports reading from a directory of files, so `CsvFile` is not an
accurate name.
* csv related function arguments are getting a bit long, should we introduce
a csv option struct to capture the following configs with sensible defaults?
- schema
- has_header
- delimiter
- infer_max_read_records
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]