[ 
https://issues.apache.org/jira/browse/ARROW-12025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17329352#comment-17329352
 ] 

Šimon Procházka commented on ARROW-12025:
-----------------------------------------

Hi [~alexander_m] ,

I have examined this case and I don't think it is a bug.

>From the documentation, skip_rows means to skip two first line before header, 
>therefore you get two rows skipped, you get header with one column named 
>"string with several"

and the you have line with 4 columns with values newline characters",2,3,4, 
therefore the parse error.

 

Try to come up with a new example, otherwise I think this issue can be closed.

 

> pyarrow read_csv works incorrectly with multilines if skiprows is present
> -------------------------------------------------------------------------
>
>                 Key: ARROW-12025
>                 URL: https://issues.apache.org/jira/browse/ARROW-12025
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Alexander M
>            Priority: Critical
>
> Reproducer:
> import os
> from pyarrow.csv import read_csv, ReadOptions
> import pyarrow
> print("pyarrow.__version__:", pyarrow.__version__)
> test_filename = "test.csv"
> test_data = """col1,col2,col3,col4
> "This is a very long
> string with several
> newline characters",2,3,4
> """
> try :
>     with open(test_filename, "w") as f:
>         f.write(test_data)
>     ans_1 = read_csv(test_filename) # works fine
>     print("ans_1: \n", ans_1)
>     ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
>     print("ans_2: \n", ans_2)
> finally:
>     os.remove(test_filename)
>  
> Output:
> pyarrow.__version__: 3.0.0
> ans_1:
>  pyarrow.Table
> col1: string
> col2: int64
> col3: int64
> col4: int64
> Traceback (most recent call last):
>  File "pyarrow_bug.py", line 21, in <module>
>  ans_2 = read_csv(test_filename, read_options=ReadOptions(skip_rows=2))
>  File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
>  File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 4
>  
> Note: python version: 3.8.8, platform: Ubuntu 20.04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to