[ https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Gruener updated ARROW-2842: ---------------------------------- Description: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this. Note that I am using libhdfs3. was: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&) @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this. Note that I am using libhdfs3. > [Python] Cannot read parquet files with row group size of 1 From HDFS > --------------------------------------------------------------------- > > Key: ARROW-2842 > URL: https://issues.apache.org/jira/browse/ARROW-2842 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Robert Gruener > Priority: Major > Attachments: single-row.parquet > > > This might be a bug in parquet-cpp, I need to spend a bit more time tracking > this down but basically given a file with a single row on hdfs, reading it > with pyarrow yields this error > ``` > TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the > stream > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) > @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) > @ parquet::SerializedFile::ParseMetaData() > @ > parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, > std::default_delete<parquet::RandomAccessSource> >, > parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> > const&) > @ > parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, > std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties > const&, std::shared_ptr<parquet::FileMetaData> const&) > @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> > const&, arrow::MemoryPool*, parquet::ReaderProperties const&, > std::shared_ptr<parquet::FileMetaData> const&, > std::unique_ptr<parquet::arrow::FileReader, > std::default_delete<parquet::arrow::FileReader> >*) > @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, > _object*) > ``` > The following code causes it: > ``` > import pyarrow > import pyarrow.parquet as pq > > fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in > namenode information > file_object = fs.open('single-row.parquet') # update for hdfs path of file > pq.read_metadata(file_object) # this works > parquet_file = pq.ParquetFile(file_object) > parquet_file.read_row_group(0) # throws error > ``` > > I am working on writing a unit test for this. Note that I am using libhdfs3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)