[
https://issues.apache.org/jira/browse/ARROW-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-1883:
-----------------------------------------
Description:
Found this bug in the example in the pandas documentation (), which does:
{{df = pd.DataFrame({'a': list('abc'),
'b': list(range(1, 4)),
'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
df.to_parquet('example_pa.parquet', engine='pyarrow')
pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])}}
and this raises in the last line reading a subset of columns:
{{...
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']
table.pxi in pyarrow.lib.Table.__getitem__()
table.pxi in pyarrow.lib.Table.column()
IndexError: Table column index 6 is out of range}}
This is due to checking the `pandas_metadata` for all columns (and in this case
trying to deal with a datetime tz column), while in practice not all columns
are present in this case ('mismatch' between pandas metadata and actual
schema).
A smaller example without parquet:
{{In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01",
periods=3, tz='Europe/Brussels')})
In [39]: table = pyarrow.Table.from_pandas(df)
In [40]: table
Out[40]:
pyarrow.Table
a: int64
b: timestamp[ns, tz=Europe/Brussels]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [41]: table.to_pandas()
Out[41]:
a b
0 1 2017-01-01 00:00:00+01:00
1 2 2017-01-02 00:00:00+01:00
2 3 2017-01-03 00:00:00+01:00
In [44]: table_without_tz = table.remove_column(1)
In [45]: table_without_tz
Out[45]:
pyarrow.Table
a: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [46]: table_without_tz.to_pandas() # <------ wrong output !
Out[46]:
a
1970-01-01 01:00:00+01:00 1
1970-01-01 01:00:00.000000001+01:00 2
1970-01-01 01:00:00.000000002+01:00 3
In [47]: table_without_tz2 = table_without_tz.remove_column(1)
In [48]: table_without_tz2
Out[48]:
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [49]: table_without_tz2.to_pandas() # <------ error !
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-49-c82f33476c6b> in <module>()
----> 1 table_without_tz2.to_pandas()
table.pxi in pyarrow.lib.Table.to_pandas()
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in table_to_blockmanager(options, table, memory_pool, nthreads)
289 pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8'))
290 index_columns = pandas_metadata['index_columns']
--> 291 table = _add_any_metadata(table, pandas_metadata)
292
293 block_table = table
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']
table.pxi in pyarrow.lib.Table.__getitem__()
table.pxi in pyarrow.lib.Table.column()
IndexError: Table column index 1 is out of range}}
The reason is that `_add_any_metadata` does not check if the column it is
processing (currently only datetime tz columns need such processing) is
actually present in the schema.
Working on a fix, will submit a PR.
was:
Found this bug in the example in the pandas documentation (), which does:
```
df = pd.DataFrame({'a': list('abc'),
'b': list(range(1, 4)),
'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('20130101', periods=3),
'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
df.to_parquet('example_pa.parquet', engine='pyarrow')
pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
```
and this raises in the last line reading a subset of columns:
```
...
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']
table.pxi in pyarrow.lib.Table.__getitem__()
table.pxi in pyarrow.lib.Table.column()
IndexError: Table column index 6 is out of range
```
This is due to checking the `pandas_metadata` for all columns (and in this case
trying to deal with a datetime tz column), while in practice not all columns
are present in this case ('mismatch' between pandas metadata and actual
schema).
A smaller example without parquet:
```
In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01",
periods=3, tz='Europe/Brussels')})
In [39]: table = pyarrow.Table.from_pandas(df)
In [40]: table
Out[40]:
pyarrow.Table
a: int64
b: timestamp[ns, tz=Europe/Brussels]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [41]: table.to_pandas()
Out[41]:
a b
0 1 2017-01-01 00:00:00+01:00
1 2 2017-01-02 00:00:00+01:00
2 3 2017-01-03 00:00:00+01:00
In [44]: table_without_tz = table.remove_column(1)
In [45]: table_without_tz
Out[45]:
pyarrow.Table
a: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [46]: table_without_tz.to_pandas() # <------ wrong output !
Out[46]:
a
1970-01-01 01:00:00+01:00 1
1970-01-01 01:00:00.000000001+01:00 2
1970-01-01 01:00:00.000000002+01:00 3
In [47]: table_without_tz2 = table_without_tz.remove_column(1)
In [48]: table_without_tz2
Out[48]:
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}
In [49]: table_without_tz2.to_pandas() # <------ error !
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-49-c82f33476c6b> in <module>()
----> 1 table_without_tz2.to_pandas()
table.pxi in pyarrow.lib.Table.to_pandas()
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in table_to_blockmanager(options, table, memory_pool, nthreads)
289 pandas_metadata = json.loads(metadata[b'pandas'].decode('utf8'))
290 index_columns = pandas_metadata['index_columns']
--> 291 table = _add_any_metadata(table, pandas_metadata)
292
293 block_table = table
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']
table.pxi in pyarrow.lib.Table.__getitem__()
table.pxi in pyarrow.lib.Table.column()
IndexError: Table column index 1 is out of range
```
The reason is that `_add_any_metadata` does not check if the column it is
processing (currently only datetime tz columns need such processing) is
actually present in the schema.
Working on a fix, will submit a PR.
> [Python] BUG: Table.to_pandas metadata checking fails if columns are not
> present
> --------------------------------------------------------------------------------
>
> Key: ARROW-1883
> URL: https://issues.apache.org/jira/browse/ARROW-1883
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.7.1
> Reporter: Joris Van den Bossche
>
> Found this bug in the example in the pandas documentation (), which does:
> {{df = pd.DataFrame({'a': list('abc'),
> 'b': list(range(1, 4)),
> 'c': np.arange(3, 6).astype('u1'),
> 'd': np.arange(4.0, 7.0, dtype='float64'),
> 'e': [True, False, True],
> 'f': pd.date_range('20130101', periods=3),
> 'g': pd.date_range('20130101', periods=3,
> tz='US/Eastern')})
> df.to_parquet('example_pa.parquet', engine='pyarrow')
> pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])}}
> and this raises in the last line reading a subset of columns:
> {{...
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in _add_any_metadata(table, pandas_metadata)
> 357 for i, col_meta in enumerate(pandas_metadata['columns']):
> 358 if col_meta['pandas_type'] == 'datetimetz':
> --> 359 col = table[i]
> 360 converted = col.to_pandas()
> 361 tz = col_meta['metadata']['timezone']
> table.pxi in pyarrow.lib.Table.__getitem__()
> table.pxi in pyarrow.lib.Table.column()
> IndexError: Table column index 6 is out of range}}
> This is due to checking the `pandas_metadata` for all columns (and in this
> case trying to deal with a datetime tz column), while in practice not all
> columns are present in this case ('mismatch' between pandas metadata and
> actual schema).
> A smaller example without parquet:
> {{In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b':
> pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')})
> In [39]: table = pyarrow.Table.from_pandas(df)
> In [40]: table
> Out[40]:
> pyarrow.Table
> a: int64
> b: timestamp[ns, tz=Europe/Brussels]
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [41]: table.to_pandas()
> Out[41]:
> a b
> 0 1 2017-01-01 00:00:00+01:00
> 1 2 2017-01-02 00:00:00+01:00
> 2 3 2017-01-03 00:00:00+01:00
> In [44]: table_without_tz = table.remove_column(1)
> In [45]: table_without_tz
> Out[45]:
> pyarrow.Table
> a: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [46]: table_without_tz.to_pandas() # <------ wrong output !
> Out[46]:
> a
> 1970-01-01 01:00:00+01:00 1
> 1970-01-01 01:00:00.000000001+01:00 2
> 1970-01-01 01:00:00.000000002+01:00 3
> In [47]: table_without_tz2 = table_without_tz.remove_column(1)
> In [48]: table_without_tz2
> Out[48]:
> pyarrow.Table
> a: int64
> metadata
> --------
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null,
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz",
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type":
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64",
> '
> b'"metadata": null, "numpy_type": "int64", "name":
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"],
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [49]: table_without_tz2.to_pandas() # <------ error !
> ---------------------------------------------------------------------------
> IndexError Traceback (most recent call last)
> <ipython-input-49-c82f33476c6b> in <module>()
> ----> 1 table_without_tz2.to_pandas()
> table.pxi in pyarrow.lib.Table.to_pandas()
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in table_to_blockmanager(options, table, memory_pool, nthreads)
> 289 pandas_metadata =
> json.loads(metadata[b'pandas'].decode('utf8'))
> 290 index_columns = pandas_metadata['index_columns']
> --> 291 table = _add_any_metadata(table, pandas_metadata)
> 292
> 293 block_table = table
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
> in _add_any_metadata(table, pandas_metadata)
> 357 for i, col_meta in enumerate(pandas_metadata['columns']):
> 358 if col_meta['pandas_type'] == 'datetimetz':
> --> 359 col = table[i]
> 360 converted = col.to_pandas()
> 361 tz = col_meta['metadata']['timezone']
> table.pxi in pyarrow.lib.Table.__getitem__()
> table.pxi in pyarrow.lib.Table.column()
> IndexError: Table column index 1 is out of range}}
> The reason is that `_add_any_metadata` does not check if the column it is
> processing (currently only datetime tz columns need such processing) is
> actually present in the schema.
> Working on a fix, will submit a PR.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)