[
https://issues.apache.org/jira/browse/ARROW-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francois Saint-Jacques resolved ARROW-7628.
-------------------------------------------
Fix Version/s: 1.0.0
Resolution: Fixed
Issue resolved by pull request 6463
[https://github.com/apache/arrow/pull/6463]
> [Python] Better document some read_csv corner cases
> ---------------------------------------------------
>
> Key: ARROW-7628
> URL: https://issues.apache.org/jira/browse/ARROW-7628
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: Ubuntu bionic
> Reporter: Athanassios Hatzis
> Assignee: Joris Van den Bossche
> Priority: Minor
> Labels: csv, pull-request-available, pyarrow
> Fix For: 1.0.0
>
> Attachments: spc_catalog.tsv
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Hi, I have found two problematic cases, possibly bugs, in pyarrow *read_csv*
> module. I have written the following piece of code and run a test on the
> attached CSV file.
> The code compares pandas read_csv with pyarrow csv to show that the second is
> not behaving correctly with the following set of parameters:
> 1. change parameter skip_rows = 10,
> {code:python}
> Traceback (most recent call last):
> File
> "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py",
> line 3326, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "<ipython-input-21-8c5c88b190c4>", line 4, in <module>
> read_options=csv.ReadOptions(skip_rows=skip_rows,
> autogenerate_column_names=False, use_threads=True, column_names=column_names)
> File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist
> in CSV file
> {code}
> 2. change parameters skip_rows = 12, columns = None
> In this case you don't get the error above, all columns are fetched, but
> compare the two dataframes, the one from pyarrow with to_pandas() and the one
> from the output of pandas read_csv(). You will notice that the first one has
> not parsed correctly the null values ('\\N') in the last column catname. On
> the contrary pandas read_csv managed to parse all the null values correctly.
> {code:python}
> Out[28]:
> 1082 991 16.5 200 2014-09-10 1 bar
> 0 1082 997 0.55 100.0 2014-09-10 1 bar
> 1 1082 998 7.95 200.0 2014-03-03 0 \N
> 2 1083 998 12.50 NaN NaT 0 bar
> 3 1083 999 1.00 NaN NaT 0 foo
> 4 1084 994 57.30 100.0 2014-12-20 1 \N
> 5 1084 995 22.20 NaN NaT 0 foo
> 6 1084 998 48.60 200.0 2014-12-20 1 foo
> {code}
> Python code to test the attached CSV file for the bugs reported above
> {code:python}
> from pyarrow import csv
> import pyarrow as pa
> import pandas as pd
> file_location = 'spc_catalog.tsv'
> sep = '\t'
> nulls=['\\N']
> columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname']
> column_names = None
> column_types = None
> skip_rows = None
> nrecords = None
> csv.read_csv(file_location,
> parse_options=csv.ParseOptions(delimiter=sep),
> convert_options=csv.ConvertOptions(include_columns=columns,
> column_types=column_types, null_values=nulls),
> read_options=csv.ReadOptions(skip_rows=skip_rows,
> autogenerate_column_names=False, use_threads=True, column_names=column_names)
> ).to_pandas()
> pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns,
> nrows=nrecords, names=column_names, dtype=column_types)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)