[
https://issues.apache.org/jira/browse/ARROW-12025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17329352#comment-17329352
]
Šimon Procházka commented on ARROW-12025:
-----------------------------------------
Hi [~alexander_m] ,
I have examined this case and I don't think it is a bug.
>From the documentation, skip_rows means to skip two first line before header,
>therefore you get two rows skipped, you get header with one column named
>"string with several"
and the you have line with 4 columns with values newline characters",2,3,4,
therefore the parse error.
Try to come up with a new example, otherwise I think this issue can be closed.
> pyarrow read_csv works incorrectly with multilines if skiprows is present
> -------------------------------------------------------------------------
>
> Key: ARROW-12025
> URL: https://issues.apache.org/jira/browse/ARROW-12025
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Alexander M
> Priority: Critical
>
> Reproducer:
> import os
> from pyarrow.csv import read_csv, ReadOptions
> import pyarrow
> print("pyarrow.__version__:", pyarrow.__version__)
> test_filename = "test.csv"
> test_data = """col1,col2,col3,col4
> "This is a very long
> string with several
> newline characters",2,3,4
> """
> try :
> with open(test_filename, "w") as f:
> f.write(test_data)
> ans_1 = read_csv(test_filename) # works fine
> print("ans_1: \n", ans_1)
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> print("ans_2: \n", ans_2)
> finally:
> os.remove(test_filename)
>
> Output:
> pyarrow.__version__: 3.0.0
> ans_1:
> pyarrow.Table
> col1: string
> col2: int64
> col3: int64
> col4: int64
> Traceback (most recent call last):
> File "pyarrow_bug.py", line 21, in <module>
> ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
> File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
> File "pyarrow/error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
>
> Note: python version: 3.8.8, platform: Ubuntu 20.04
--
This message was sent by Atlassian Jira
(v8.3.4#803005)