[jira] [Created] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

Jira Thu, 07 Nov 2019 07:52:56 -0800

François Blanchard created ARROW-7087:
-----------------------------------------


             Summary: [Pyarrow] Table Metadata disappear when we write a 
partitioned dataset
                 Key: ARROW-7087
                 URL: https://issues.apache.org/jira/browse/ARROW-7087
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.14.1
            Reporter: François Blanchard
         Attachments: Capture d’écran 2019-11-07 à 16.46.37.png

There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
``` 
Metadata is set as expected

>> OrderedDict([('data', 'test')])
```

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
```
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
```{code}
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

Reply via email to