[ 
https://issues.apache.org/jira/browse/ARROW-16436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16436:
-----------------------------------
    Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [C++] Datasets ignores CSV autogenerate_column_names during discovery
> ---------------------------------------------------------------------
>
>                 Key: ARROW-16436
>                 URL: https://issues.apache.org/jira/browse/ARROW-16436
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 7.0.0
>            Reporter: David Li
>            Assignee: Raúl Cumplido
>            Priority: Major
>              Labels: good-first-issue, pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reproduction
> {code:python}
> import tempfile
> from pathlib import Path
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as ds
> print("PyArrow version:", pa.__version__)
> ro = csv.ReadOptions(autogenerate_column_names=True)
> po = csv.ParseOptions()
> co = csv.ConvertOptions()
> file_format = ds.CsvFileFormat(read_options=ro, parse_options=po, 
> convert_options=co)
> with tempfile.TemporaryDirectory() as td:
>     td = Path(td).resolve()
>     with (td / "test.csv").open("w") as sink:
>         sink.write("1,a,true,1\n")
>     dataset = ds.dataset(str(td), format=file_format)
>     print(dataset.to_table())
> {code}
> Result:
> {noformat}
> PyArrow version: 7.0.0
> Traceback (most recent call last):
>   File "/home/lidavidm/csvdemo.py", line 20, in <module>
>     dataset = ds.dataset(str(td), format=file_format)
>   File 
> "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py",
>  line 667, in dataset
>     return _filesystem_dataset(source, **kwargs)
>   File 
> "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py",
>  line 422, in _filesystem_dataset
>     return factory.finish(schema)
>   File "pyarrow/_dataset.pyx", line 1680, in 
> pyarrow._dataset.DatasetFactory.finish
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 
> '/tmp/tmp5rz0ipmm/test.csv': Could not open CSV input source 
> '/tmp/tmp5rz0ipmm/test.csv': Invalid: CSV file contained multiple columns 
> named 1. Is this a 'csv' file?
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to