[
https://issues.apache.org/jira/browse/ARROW-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099953#comment-17099953
]
Novice commented on ARROW-8677:
-------------------------------
I added the file to reproduce the issue as an attachment. Please untar it
first: `tar -xvf test.parquet.tgz `.
```
df = pd.read_parquet("test.parquet", engine="fastparquet") # works
df = pd.read_parquet("test.parquet", engine="pyarrow") # fails
```
The Rust code to generate the file is:
```
use parquet::{
column::writer::ColumnWriter::Int32ColumnWriter,
file::{
properties::WriterProperties,
writer::\{FileWriter, SerializedFileWriter},
},
schema::parser::parse_message_type,
};
use std::\{fs, rc::Rc};
fn main() {
let schema = "
message schema {
REQUIRED INT32 a;
}
";
let schema = Rc::new(parse_message_type(schema).unwrap());
let props = Rc::new(
WriterProperties::builder()
.set_statistics_enabled(false)
.set_dictionary_enabled(false)
.build(),
);
let file = fs::File::create("test.parquet").unwrap();
let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
let batch_size = 1000;
let mut data = vec![];
for i in 0..batch_size {
data.push(i);
}
let mut j = 0;
loop {
let mut row_group_writer = writer.next_row_group().unwrap();
let mut col_writer = row_group_writer.next_column().unwrap().unwrap();
if let Int32ColumnWriter(ref mut typed_writer) = col_writer {
typed_writer.write_batch(&data, None, None).unwrap();
} else {
panic!();
}
row_group_writer.close_column(col_writer).unwrap();
writer.close_row_group(row_group_writer).unwrap();
j += 1;
if j * batch_size > 40_000_000 {
break;
}
}
writer.close().unwrap()
}
```
To compile it you need to
```
The workaround is to create a directoy `/format` in the root of your file
system and place the Flight.proto file there.
```
As described in https://issues.apache.org/jira/browse/ARROW-8536?src=confmacro
> [Rust][Python][Parquet] Parquet write_batch and read from Python failes with
> batch size 10000 or 1 but okay with 1000
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-8677
> URL: https://issues.apache.org/jira/browse/ARROW-8677
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python, Rust
> Affects Versions: 0.17.0
> Environment: Linux debian
> Reporter: Novice
> Priority: Critical
> Attachments: test.parquet.tgz
>
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 10000 batch size, reading the Parquet file from Python
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 125, in read
> path, columns=columns, **kwargs
> File
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1537, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1262, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 707, in read
> table = reader.read(**options)
> File
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 337, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1130, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 125, in read
> path, columns=columns, **kwargs
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1537, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1262, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 707, in read
> table = reader.read(**options)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 337, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1130, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d;
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 125, in read
> path, columns=columns, **kwargs
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1537, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1262, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 707, in read
> table = reader.read(**options)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 337, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1130, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Unless I am using it wrong (which doesn't seem to be, since the API is
> simple), this is not usable at all :(
>
> EDIT: some more logs, using 1000 batch size, a lot of rows:
> ```
> >>> df = pd.read_parquet("ping_pong.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 125, in read
> path, columns=columns, **kwargs
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1537, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1262, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 707, in read
> table = reader.read(**options)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 337, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1130, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: The file only has -959432807 columns, requested metadata for
> column: 6
> ```
>
> EDIT:
> I wanted to try fastparquet, but seems fastparquet does not support
> .set_dictionary_enabled(true), so I set it to false.
> Turns out fastparquet is fine, so likely a problem with pyarrow.
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line
> 125, in read
> path, columns=columns, **kwargs
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1281, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1137, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 605, in read
> table = reader.read(**options)
> File
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 253, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1136, in
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: The file only has -580697109 columns, requested metadata for
> column: 5
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="fastparquet")
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)