Alexander M created ARROW-12025:
-----------------------------------
Summary: pyarrow read_csv works incorrectly with multilines if
skiprows is present
Key: ARROW-12025
URL: https://issues.apache.org/jira/browse/ARROW-12025
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0
Reporter: Alexander M
Reproducer:
import os
from pyarrow.csv import read_csv, ReadOptions
import pyarrow
print("pyarrow.__version__:", pyarrow.__version__)
test_filename = "test.csv"
test_data = """col1,col2,col3,col4
"This is a very long
string with several
newline characters",2,3,4
"""
try :
with open(test_filename, "w") as f:
f.write(test_data)
ans_1 = read_csv(test_filename) # works fine
print("ans_1: \n", ans_1)
ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
print("ans_2: \n", ans_2)
finally:
os.remove(test_filename)
Output:
pyarrow.__version__: 3.0.0
ans_1:
pyarrow.Table
col1: string
col2: int64
col3: int64
col4: int64
Traceback (most recent call last):
File "pyarrow_bug.py", line 21, in <module>
ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
File "pyarrow/error.pxi", line 122, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
Note: python version: 3.8.8, platform: Ubuntu 20.04
--
This message was sent by Atlassian Jira
(v8.3.4#803005)