[
https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-6231:
-----------------------------------
Component/s: C++
> [C++][Python] Consider assigning default column names when reading CSV file
> and header_rows=0
> ---------------------------------------------------------------------------------------------
>
> Key: ARROW-6231
> URL: https://issues.apache.org/jira/browse/ARROW-6231
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: csv, pull-request-available
> Fix For: 0.15.0
>
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1,
> ...") would probably be better since then at least you can see how many
> columns there are and what is in them.
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)
>
>
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt',
> parse_options=parse_options)
>
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <timed exec> in <module>
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in
> pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi
> in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be
> defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None,
> low_memory=False)
>
> In [19]: df.columns
>
>
> Out[19]:
> Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
> 16,
> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
> dtype='int64')
> {code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)