[ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2842:
----------------------------------
    Description: 
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource,
 std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties 
const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, 
std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties 
const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr<parquet::FileMetaData> const&, 
std::unique_ptr<parquet::arrow::FileReader, 
std::default_delete<parquet::arrow::FileReader> >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

  was:
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource,
 std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties 
const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, 
std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties 
const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr<parquet::FileMetaData> const&, 
std::unique_ptr<parquet::arrow::FileReader, 
std::default_delete<parquet::arrow::FileReader> >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.


> [Python] Cannot read parquet files with row group size of 1 From HDFS
> ---------------------------------------------------------------------
>
>                 Key: ARROW-2842
>                 URL: https://issues.apache.org/jira/browse/ARROW-2842
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Robert Gruener
>            Priority: Major
>         Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the 
> stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource,
>  std::default_delete<parquet::RandomAccessSource> >, 
> parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, 
> std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties 
> const&, std::shared_ptr<parquet::FileMetaData> const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr<parquet::FileMetaData> const&, 
> std::unique_ptr<parquet::arrow::FileReader, 
> std::default_delete<parquet::arrow::FileReader> >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
> namenode information
> file_object = fs.open('single-row.parquet') # update for hdfs path of file
> pq.read_metadata(file_object) # this works
> parquet_file = pq.ParquetFile(file_object)
> parquet_file.read_row_group(0) # throws error
> ```
>  
> I am working on writing a unit test for this. Note that I am using libhdfs3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to