Hossein Falaki created ARROW-4723:
-------------------------------------

             Summary: Skip _files when reading a directory containing parquet 
files
                 Key: ARROW-4723
                 URL: https://issues.apache.org/jira/browse/ARROW-4723
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Hossein Falaki


It is common for Apache Spark or other big data platforms to save additional 
meta-data files denoted with _ when saving parquet data.

When using  {{make_batch_reader}} to load a directory saved by parquet 
containing such files we encounter the following error:
{code:java}
PetastormMetadataError Traceback (most recent call last)
/databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py
 in infer_or_load_unischema(dataset)
    388 try:
--> 389 return get_schema(dataset) 
    390 except PetastormMetadataError:

/databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py
 in get_schema(dataset)
    342 raise PetastormMetadataError( 
--> 343 'Could not find _common_metadata file. Use materialize_dataset(..) in' 
    344 ' petastorm.etl.dataset_metadata.py to generate this file in your ETL 
code.'

PetastormMetadataError: Could not find _common_metadata file. Use 
materialize_dataset(..) in petastorm.etl.dataset_metadata.py to generate this 
file in your ETL code. You can generate it on an existing dataset using 
petastorm-generate-metadata.py{code}
 

This is because our Runtime stores the following two files at the end of the 
job:
{code:java}
dbfs:/tmp/petastorm/_committed_4686077819843716563      
_committed_4686077819843716563  1965
dbfs:/tmp/petastorm/_started_4686077819843716563{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to