[
https://issues.apache.org/jira/browse/ARROW-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276787#comment-16276787
]
ASF GitHub Bot commented on ARROW-1883:
---------------------------------------
jorisvandenbossche opened a new pull request #1386: ARROW-1883: [Python] Fix
handling of metadata in to_pandas when not all columns are present
URL: https://github.com/apache/arrow/pull/1386
This closes [ARROW-1883](https://issues.apache.org/jira/browse/ARROW-1883).
So basically what I did in `_add_any_metadata` was replacing `col =
table[i]` with:
```
idx = schema.get_field_index(raw_name)
if idx != -1:
col = table[idx]
```
to check that the column is actually present in the schema. However, that
involved some more code to get to `raw_name` (the name how the column is
present in the schema), as this does not always match the name in
`pandas_metadata['column'][..]['name']`. Not sure if there is a better way to
get that name.
(or if it would be better to filter `pandas_metadata` earlier on, instead of
checking when actually trying to process the metadata of that column)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] BUG: Table.to_pandas metadata checking fails if columns are not
> present
> --------------------------------------------------------------------------------
>
> Key: ARROW-1883
> URL: https://issues.apache.org/jira/browse/ARROW-1883
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.7.1
> Reporter: Joris Van den Bossche
> Labels: pull-request-available
>
> Found this bug in the example in the pandas documentation
> (http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet), which does:
> {code}
> df = pd.DataFrame({'a': list('abc'),
> 'b': list(range(1, 4)),
> 'c': np.arange(3, 6).astype('u1'),
> 'd': np.arange(4.0, 7.0, dtype='float64'),
> 'e': [True, False, True],
> 'f': pd.date_range('20130101', periods=3),
> 'g': pd.date_range('20130101', periods=3,
> tz='US/Eastern')})
> df.to_parquet('example_pa.parquet', engine='pyarrow')
> pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
> {code}
> and this raises in the last line reading a subset of columns:
> {code}
> ...
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in _add_any_metadata(table, pandas_metadata)
> 357 for i, col_meta in enumerate(pandas_metadata['columns']):
> 358 if col_meta['pandas_type'] == 'datetimetz':
> --> 359 col = table[i]
> 360 converted = col.to_pandas()
> 361 tz = col_meta['metadata']['timezone']
> table.pxi in pyarrow.lib.Table.__getitem__()
> table.pxi in pyarrow.lib.Table.column()
> IndexError: Table column index 6 is out of range
> {code}
> This is due to checking the `pandas_metadata` for all columns (and in this
> case trying to deal with a datetime tz column), while in practice not all
> columns are present in this case ('mismatch' between pandas metadata and
> actual schema).
> A smaller example without parquet:
> {code}
> In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01",
> periods=3, tz='Europe/Brussels')})
> In [39]: table = pyarrow.Table.from_pandas(df)
> In [40]: table
> Out[40]:
> pyarrow.Table
> a: int64
> b: timestamp[ns, tz=Europe/Brussels]
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [41]: table.to_pandas()
> Out[41]:
> a b
> 0 1 2017-01-01 00:00:00+01:00
> 1 2 2017-01-02 00:00:00+01:00
> 2 3 2017-01-03 00:00:00+01:00
> In [44]: table_without_tz = table.remove_column(1)
> In [45]: table_without_tz
> Out[45]:
> pyarrow.Table
> a: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [46]: table_without_tz.to_pandas() # <------ wrong output !
> Out[46]:
> a
> 1970-01-01 01:00:00+01:00 1
> 1970-01-01 01:00:00.000000001+01:00 2
> 1970-01-01 01:00:00.000000002+01:00 3
> In [47]: table_without_tz2 = table_without_tz.remove_column(1)
> In [48]: table_without_tz2
> Out[48]:
> pyarrow.Table
> a: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [49]: table_without_tz2.to_pandas() # <------ error !
> ---------------------------------------------------------------------------
> IndexError Traceback (most recent call last)
> <ipython-input-49-c82f33476c6b> in <module>()
> ----> 1 table_without_tz2.to_pandas()
> table.pxi in pyarrow.lib.Table.to_pandas()
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in table_to_blockmanager(options, table, memory_pool, nthreads)
> 289 pandas_metadata =
> json.loads(metadata[b'pandas'].decode('utf8'))
> 290 index_columns = pandas_metadata['index_columns']
> --> 291 table = _add_any_metadata(table, pandas_metadata)
> 292
> 293 block_table = table
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in _add_any_metadata(table, pandas_metadata)
> 357 for i, col_meta in enumerate(pandas_metadata['columns']):
> 358 if col_meta['pandas_type'] == 'datetimetz':
> --> 359 col = table[i]
> 360 converted = col.to_pandas()
> 361 tz = col_meta['metadata']['timezone']
> table.pxi in pyarrow.lib.Table.__getitem__()
> table.pxi in pyarrow.lib.Table.column()
> IndexError: Table column index 1 is out of range
> {code}
> The reason is that `_add_any_metadata` does not check if the column it is
> processing (currently only datetime tz columns need such processing) is
> actually present in the schema.
> Working on a fix, will submit a PR.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)