[
https://issues.apache.org/jira/browse/ARROW-16436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-16436:
-----------------------------------
Labels: good-first-issue pull-request-available (was: good-first-issue)
> [C++] Datasets ignores CSV autogenerate_column_names during discovery
> ---------------------------------------------------------------------
>
> Key: ARROW-16436
> URL: https://issues.apache.org/jira/browse/ARROW-16436
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 7.0.0
> Reporter: David Li
> Assignee: Raúl Cumplido
> Priority: Major
> Labels: good-first-issue, pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Reproduction
> {code:python}
> import tempfile
> from pathlib import Path
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> print("PyArrow version:", pa.__version__)
> ro = csv.ReadOptions(autogenerate_column_names=True)
> po = csv.ParseOptions()
> co = csv.ConvertOptions()
> file_format = ds.CsvFileFormat(read_options=ro, parse_options=po,
> convert_options=co)
> with tempfile.TemporaryDirectory() as td:
> td = Path(td).resolve()
> with (td / "test.csv").open("w") as sink:
> sink.write("1,a,true,1\n")
> dataset = ds.dataset(str(td), format=file_format)
> print(dataset.to_table())
> {code}
> Result:
> {noformat}
> PyArrow version: 7.0.0
> Traceback (most recent call last):
> File "/home/lidavidm/csvdemo.py", line 20, in <module>
> dataset = ds.dataset(str(td), format=file_format)
> File
> "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py",
> line 667, in dataset
> return _filesystem_dataset(source, **kwargs)
> File
> "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py",
> line 422, in _filesystem_dataset
> return factory.finish(schema)
> File "pyarrow/_dataset.pyx", line 1680, in
> pyarrow._dataset.DatasetFactory.finish
> File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from
> '/tmp/tmp5rz0ipmm/test.csv': Could not open CSV input source
> '/tmp/tmp5rz0ipmm/test.csv': Invalid: CSV file contained multiple columns
> named 1. Is this a 'csv' file?
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)