[ 
https://issues.apache.org/jira/browse/ARROW-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Luis Cano Rodríguez updated ARROW-18084:
---------------------------------------------
    Description: 
I'm trying to read a specific large CSV file 
({{the-reddit-climate-change-dataset-comments.csv}} from [this 
dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset])
 by batches. This is my code:

{code:python}
import os

import pyarrow as pa
from pyarrow.csv import open_csv, ReadOptions
import pyarrow.parquet as pq

filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"

print(f"Reading {filename}...")
mmap = pa.memory_map(filename)

reader = open_csv(mmap)
while True:
    try:
        batch = reader.read_next_batch()
        print(len(batch))
    except StopIteration:
        break
{code}

But, after a few batches, I get an exception:


{noformat}
Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
1233
1279
1293

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [1], in <cell line: 14>()
     13 while True:
     14     try:
---> 15         batch = reader.read_next_batch()
     16         print(len(batch))
     17     except StopIteration:

File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in 
pyarrow.lib.RecordBatchReader.read_next_batch()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
pyarrow.lib.check_status()

ArrowInvalid: CSV parser got out of sync with chunker
{noformat}

I have tried changing the block size, but I always end up with that error 
sooner or later:

- With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11 
rows and then crashes
- With 100_000, 103 rows and then crashes
- 1_000_000: 1164 rows and then crashes
- 10_000_000: 12370 rows and then crashes

I am not sure what else to try here. According to [the C++ source 
code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267],
 this "should not happen".

I have tried with pyarrow 7.0 and 9.0, identical result and traceback.

  was:
I'm trying to read a specific large CSV file 
(`the-reddit-climate-change-dataset-comments.csv` from [this 
dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset])
 by batches. This is my code:

{code:python}
import os

import pyarrow as pa
from pyarrow.csv import open_csv, ReadOptions
import pyarrow.parquet as pq

filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"

print(f"Reading {filename}...")
mmap = pa.memory_map(filename)

reader = open_csv(mmap)
while True:
    try:
        batch = reader.read_next_batch()
        print(len(batch))
    except StopIteration:
        break
{code}

But, after a few batches, I get an exception:


{noformat}
Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
1233
1279
1293

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [1], in <cell line: 14>()
     13 while True:
     14     try:
---> 15         batch = reader.read_next_batch()
     16         print(len(batch))
     17     except StopIteration:

File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in 
pyarrow.lib.RecordBatchReader.read_next_batch()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
pyarrow.lib.check_status()

ArrowInvalid: CSV parser got out of sync with chunker
{noformat}

I have tried changing the block size, but I always end up with that error 
sooner or later:

- With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11 
rows and then crashes
- With 100_000, 103 rows and then crashes
- 1_000_000: 1164 rows and then crashes
- 10_000_000: 12370 rows and then crashes

I am not sure what else to try here. According to [the C++ source 
code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267],
 this "should not happen".

I have tried with pyarrow 7.0 and 9.0, identical result and traceback.


> "CSV parser got out of sync with chunker" on subsequent batches regardless of 
> block size
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-18084
>                 URL: https://issues.apache.org/jira/browse/ARROW-18084
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 7.0.0, 9.0.0
>         Environment: Ubuntu Linux
> pyarrow 9.0.0 installed with pip (manylinux wheel)
> Python 3.9.0 from conda-forge
> GCC 9.4.0
>            Reporter: Juan Luis Cano Rodríguez
>            Priority: Major
>         Attachments: Screenshot 2022-10-18 at 10-11-29 JupyterLab · 
> Orchest.png
>
>
> I'm trying to read a specific large CSV file 
> ({{the-reddit-climate-change-dataset-comments.csv}} from [this 
> dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset])
>  by batches. This is my code:
> {code:python}
> import os
> import pyarrow as pa
> from pyarrow.csv import open_csv, ReadOptions
> import pyarrow.parquet as pq
> filename = 
> "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"
> print(f"Reading {filename}...")
> mmap = pa.memory_map(filename)
> reader = open_csv(mmap)
> while True:
>     try:
>         batch = reader.read_next_batch()
>         print(len(batch))
>     except StopIteration:
>         break
> {code}
> But, after a few batches, I get an exception:
> {noformat}
> Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
> 1233
> 1279
> 1293
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> Input In [1], in <cell line: 14>()
>      13 while True:
>      14     try:
> ---> 15         batch = reader.read_next_batch()
>      16         print(len(batch))
>      17     except StopIteration:
> File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in 
> pyarrow.lib.RecordBatchReader.read_next_batch()
> File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
> pyarrow.lib.check_status()
> ArrowInvalid: CSV parser got out of sync with chunker
> {noformat}
> I have tried changing the block size, but I always end up with that error 
> sooner or later:
> - With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 
> 11 rows and then crashes
> - With 100_000, 103 rows and then crashes
> - 1_000_000: 1164 rows and then crashes
> - 10_000_000: 12370 rows and then crashes
> I am not sure what else to try here. According to [the C++ source 
> code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267],
>  this "should not happen".
> I have tried with pyarrow 7.0 and 9.0, identical result and traceback.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to