Brecht Machiels created ARROW-1429:
--------------------------------------

             Summary: Error loading parquet file with _metadata from HDFS 
(pyarrow.lib.ArrowIOError: Failed to open local file)
                 Key: ARROW-1429
                 URL: https://issues.apache.org/jira/browse/ARROW-1429
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.6.0
         Environment: RHEL 6.8, Python 3.5.4 (Anaconda), Hadoop 2.6.0-cdh5.8.3
            Reporter: Brecht Machiels


I can open tables stored on HDFS as long as there is not _metadata file besides 
the parquet files.

For two tables with a _metadata file I get the following traceback:

{code}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 199, in 
read_table
    pq_table = read_hdfs_parquet(hdfs_path, columns)
  File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 251, in 
read_hdfs_parquet
    return HDFS_CONNECTION.read_parquet(hdfs_path, columns)
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/filesystem.py",
 line 168, in read_parquet
    filesystem=self)
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py",
 line 535, in __init__
    self.common_metadata = ParquetFile(self.metadata_path).metadata
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py",
 line 54, in __init__
    self.reader.open(source, metadata=metadata)
  File "_parquet.pyx", line 398, in pyarrow._parquet.ParquetReader.open
  File "io.pxi", line 705, in pyarrow.lib.get_reader
  File "io.pxi", line 472, in pyarrow.lib.memory_map
  File "io.pxi", line 451, in pyarrow.lib.MemoryMappedFile._open
  File "error.pxi", line 72, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Failed to open local file: 
hdfs://nameservice1/path/to/table/_metadata
{code}

For another table with a _metadata file:

{code}
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 199, in 
read_table
    pq_table = read_hdfs_parquet(hdfs_path, columns)
  File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 251, in 
read_hdfs_parquet
    return HDFS_CONNECTION.read_parquet(hdfs_path, columns)
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/filesystem.py",
 line 168, in read_parquet
    filesystem=self)
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py",
 line 548, in __init__
    self.validate_schemas()
  File 
"/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py",
 line 557, in validate_schemas
    self.schema = self.pieces[0].get_metadata(open_file).schema
IndexError: list index out of range
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to