[
https://issues.apache.org/jira/browse/ARROW-12025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332558#comment-17332558
]
Alexander M commented on ARROW-12025:
-------------------------------------
Hi [~BBQing],
Thanks for the answer!
I have examined PyArrow documentation more closely and looks like you right -
PyArrow read_csv skip_rows parameter doesn't guarantee correctness of skipping
rows in that case, i was just confused by `skiprows` meaning in the pandas. So
thanks again and sorry for disturbing.
> pyarrow read_csv works incorrectly with multilines if skiprows is present
> -------------------------------------------------------------------------
>
> Key: ARROW-12025
> URL: https://issues.apache.org/jira/browse/ARROW-12025
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Alexander M
> Priority: Critical
>
> Reproducer:
> import os
> from pyarrow.csv import read_csv, ReadOptions
> import pyarrow
> print("pyarrow.__version__:", pyarrow.__version__)
> test_filename = "test.csv"
> test_data = """col1,col2,col3,col4
> "This is a very long
> string with several
> newline characters",2,3,4
> """
> try :
> with open(test_filename, "w") as f:
> f.write(test_data)
> ans_1 = read_csv(test_filename) # works fine
> print("ans_1: \n", ans_1)
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> print("ans_2: \n", ans_2)
> finally:
> os.remove(test_filename)
>
> Output:
> pyarrow.__version__: 3.0.0
> ans_1:
> pyarrow.Table
> col1: string
> col2: int64
> col3: int64
> col4: int64
> Traceback (most recent call last):
> File "pyarrow_bug.py", line 21, in <module>
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
> File "pyarrow/error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
>
> Note: python version: 3.8.8, platform: Ubuntu 20.04
--
This message was sent by Atlassian Jira
(v8.3.4#803005)