Robert Gruener created ARROW-2842:
-------------------------------------
Summary: [Python] Cannot read parquet files with row group size of
1 From HDFS
Key: ARROW-2842
URL: https://issues.apache.org/jira/browse/ARROW-2842
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Robert Gruener
Attachments: single-row.parquet
This might be a bug in parquet-cpp, I need to spend a bit more time tracking
this down but basically given a file with a single row on hdfs, reading it with
pyarrow yields this error
```
TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from
"10.103.182.28:50010": End of the stream
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
@ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
@ parquet::SerializedFile::ParseMetaData()
@
parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource,
std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties
const&, std::shared_ptr<parquet::FileMetaData> const&)
@
parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource,
std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties
const&, std::shared_ptr<parquet::FileMetaData> const&)
@ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile>
const&, arrow::MemoryPool*, parquet::ReaderProperties const&,
std::shared_ptr<parquet::FileMetaData> const&,
std::unique_ptr<parquet::arrow::FileReader,
std::default_delete<parquet::arrow::FileReader> >*)
@ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*,
_object*)
```
The following code causes it:
```
import pyarrow
import pyarrow.parquet as pq
fs = pyarrow.hdfs.connect() # fill in namenode information
file_object = fs.open('single-row.parquet') # update for hdfs path of file
pq.read_metadata(file_object) # this works
parquet_file = pq.ParquetFile(file_object)
parquet_file.read_row_group(0) # throws error
```
I am working on writing a unit test for this
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)