westonpace commented on issue #34010: URL: https://github.com/apache/arrow/issues/34010#issuecomment-1414553376
It is not obvious but it is possible. The `block_size` (in [`pyarrow.csv.ReadOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) is what actually determines our inference depth). In order to specify a custom read options you will need to create a [`pyarrow.dataset.CsvFileFormat`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.CsvFileFormat.html#pyarrow.dataset.CsvFileFormat). Regrettably the inference depth is always somewhat tied in with our I/O performance. However, I suspect you can bump up the default quite a bit before you start to notice significant effects. A complete example: ``` import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as ds MiB = 1024*1024 read_options = csv.ReadOptions(block_size=16*MiB) # Note, the default is 1MiB csv_format = ds.CsvFileFormat(read_options=read_options) my_dataset = ds.dataset('/tmp/my_dataset', format=csv_format) print(my_dataset.to_table()) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
