Alexander M created ARROW-12025:
-----------------------------------

             Summary: pyarrow read_csv works incorrectly with multilines if 
skiprows is present
                 Key: ARROW-12025
                 URL: https://issues.apache.org/jira/browse/ARROW-12025
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 3.0.0
            Reporter: Alexander M


Reproducer:
import os

from pyarrow.csv import read_csv, ReadOptions
import pyarrow
print("pyarrow.__version__:", pyarrow.__version__)

test_filename = "test.csv"
test_data = """col1,col2,col3,col4
"This is a very long
string with several
newline characters",2,3,4
"""

try :
    with open(test_filename, "w") as f:
        f.write(test_data)

    ans_1 = read_csv(test_filename) # works fine
    print("ans_1: \n", ans_1)
    ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
    print("ans_2: \n", ans_2)
finally:
    os.remove(test_filename)
 
Output:
pyarrow.__version__: 3.0.0
ans_1:
 pyarrow.Table
col1: string
col2: int64
col3: int64
col4: int64
Traceback (most recent call last):
 File "pyarrow_bug.py", line 21, in <module>
 ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
 File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
 File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
 
Note: python version: 3.8.8, platform: Ubuntu 20.04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to