[ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725179#comment-16725179
 ] 

Karl Koster commented on ARROW-2860:
------------------------------------

I am running into the same problem. It looks like 
https://issues.apache.org/jira/browse/ARROW-2891 addresses this within a single 
write (i.e. multiple partitions on a single write where one or more have empty 
column sets). The problem I am running in to is across writes (e.g. 
incrementally adding partitions) when a write contains a column that has only 
NaN's but other partitions have values. Accessing the earliest or latest 
partition for the schema would likely solve this. Something like the read does 
when reading multiple partitions (it takes the earliest schema).

> [Python] Null values in a single partition of Parquet dataset, results in 
> invalid schema on read
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2860
>                 URL: https://issues.apache.org/jira/browse/ARROW-2860
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Sam Oluwalana
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.12.0
>
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
>     """Generate data."""
>     now = datetime.utcnow() + timedelta(seconds=offset)
>     obj = {
>         'event_type': event_type,
>         'event_id': event_id,
>         'event_date': now.date(),
>         'foo': None,
>         'bar': u'hello',
>     }
>     if event_type == 2:
>         obj['foo'] = 1
>         obj['bar'] = u'world'
>     if event_type == 3:
>         obj['different'] = u'data'
>         obj['bar'] = u'event type 3'
>     else:
>         obj['different'] = None
>     return obj
> data = [
>     generate_data(1, 1, 1),
>     generate_data(1, 1, 3600 * 72),
>     generate_data(2, 1, 1),
>     generate_data(2, 1, 3600 * 72),
>     generate_data(3, 1, 1),
>     generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in <module>
>     dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
>     self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
>     dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to