[jira] [Updated] (ARROW-6231) [C++][Python] Consider assigning default column names when reading CSV file and header_rows=0

Neal Richardson (Jira) Mon, 02 Sep 2019 08:03:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neal Richardson updated ARROW-6231:
-----------------------------------
    Component/s: C++

> [C++][Python] Consider assigning default column names when reading CSV file 
> and header_rows=0
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6231
>                 URL: https://issues.apache.org/jira/browse/ARROW-6231
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: csv, pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1, 
> ...") would probably be better since then at least you can see how many 
> columns there are and what is in them. 
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)       
>                                                                               
>     
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', 
> parse_options=parse_options)                                                  
>                     
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <timed exec> in <module>
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in 
> pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi 
> in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be 
> defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, 
> low_memory=False)                                                             
>         
> In [19]: df.columns                                                           
>                                                                               
>     
> Out[19]: 
> Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 
> 16,
>             17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
>            dtype='int64')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6231) [C++][Python] Consider assigning default column names when reading CSV file and header_rows=0

Reply via email to