yurikoomiga created ARROW-16642:
-----------------------------------
Summary: An Error Occured While Reading Parquet File Using C++ -
GetRecordBatchReader -Corrupt snappy compressed data.
Key: ARROW-16642
URL: https://issues.apache.org/jira/browse/ARROW-16642
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 8.0.0
Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0 python3.8,
Reporter: yurikoomiga
Attachments: test_std_02.py
Hi All
When I use Arrow Reading Parquet File like follow:
```
auto st = parquet::arrow::FileReader::Make(
arrow::default_memory_pool(),
parquet::ParquetFileReader::Open(_parquet, _properties),
&_reader);
arrow::Status status =
_reader->GetRecordBatchReader(\{_current_group},_parquet_column_ids,
&_rb_batch);
_reader->set_batch_size(65536);
_reader->set_use_threads(true);
status = _rb_batch->ReadNext(&_batch); `
```
status is not ok and an error occured like this:
`IOError: Corrupt snappy compressed data.`
When I comment out this statement ` _reader->set_use_threads(true);`,The
program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using
`_reader->set_use_threads(true); `and a single column will not occur error
The testing parquet file is created by pyarrow,I use only 1 group and each
group has 3000000 records.
The parquet file has 20 columns including int and string types
you can create a test parquet file using attachment python script
Reading file using C++,arrow 7.0.0 ,snappy 1.1.8
Writting file using python3.8 ,pyarrow 7.0.0
Looking forward to your reply
Thank you!
--
This message was sent by Atlassian Jira
(v8.20.7#820007)