Novice created ARROW-8677:
-----------------------------
Summary: Rust Parquet write_batch failes with batch size 10000 or
1 but okay with 1000
Key: ARROW-8677
URL: https://issues.apache.org/jira/browse/ARROW-8677
Project: Apache Arrow
Issue Type: Bug
Components: Rust
Affects Versions: 0.17.0
Environment: Linux debian
Reporter: Novice
I am using Rust to write Parquet file and read from Python. However, when
write_batch with 10000 batch size, reading the Parquet file from Python gives
the error below:
```
>>> pd.read_parquet("some.parquet", engine="pyarrow")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
296, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
125, in read
path, columns=columns, **kwargs
File
"/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
line 1537, in read_table
use_pandas_metadata=use_pandas_metadata)
File
"/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
line 1262, in read
use_pandas_metadata=use_pandas_metadata)
File
"/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
line 707, in read
table = reader.read(**options)
File
"/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
line 337, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1130, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
OSError: Unexpected end of stream
```
Also, when using batch size 1 and then read from Python, there is error too:
```
>>> pd.read_parquet("some.parquet", engine="pyarrow")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
296, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
125, in read
path, columns=columns, **kwargs
File
"/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line
1537, in read_table
use_pandas_metadata=use_pandas_metadata)
File
"/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line
1262, in read
use_pandas_metadata=use_pandas_metadata)
File
"/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line
707, in read
table = reader.read(**options)
File
"/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line
337, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1130, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
OSError: The file only has 0 columns, requested metadata for column: 6
```
Using batch size 1000 is fine.
Note that my data has 450047 rows.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)