Juan Luis Cano Rodríguez created ARROW-18084:
------------------------------------------------
Summary: "CSV parser got out of sync with chunker" on subsequent
batches regardless of block size
Key: ARROW-18084
URL: https://issues.apache.org/jira/browse/ARROW-18084
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 9.0.0, 7.0.0
Environment: Ubuntu Linux
pyarrow 9.0.0 installed with pip (manylinux wheel)
Python 3.9.0 from conda-forge
GCC 9.4.0
Reporter: Juan Luis Cano Rodríguez
Attachments: Screenshot 2022-10-18 at 10-11-29 JupyterLab · Orchest.png
I'm trying to read a specific large CSV file
(`the-reddit-climate-change-dataset-comments.csv` from [this
dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset])
by batches. This is my code:
{code:python}
import os
import pyarrow as pa
from pyarrow.csv import open_csv, ReadOptions
import pyarrow.parquet as pq
filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"
print(f"Reading {filename}...")
mmap = pa.memory_map(filename)
reader = open_csv(mmap)
while True:
try:
batch = reader.read_next_batch()
print(len(batch))
except StopIteration:
break
{code}
But, after a few batches, I get an exception:
{noformat}
Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
1233
1279
1293
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Input In [1], in <cell line: 14>()
13 while True:
14 try:
---> 15 batch = reader.read_next_batch()
16 print(len(batch))
17 except StopIteration:
File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in
pyarrow.lib.RecordBatchReader.read_next_batch()
File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in
pyarrow.lib.check_status()
ArrowInvalid: CSV parser got out of sync with chunker
{noformat}
I have tried changing the block size, but I always end up with that error
sooner or later:
- With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11
rows and then crashes
- With 100_000, 103 rows and then crashes
- 1_000_000: 1164 rows and then crashes
- 10_000_000: 12370 rows and then crashes
I am not sure what else to try here. According to [the C++ source
code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267],
this "should not happen".
I have tried with pyarrow 7.0 and 9.0, identical result and traceback.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)