[
https://issues.apache.org/jira/browse/ARROW-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-1658:
----------------------------------
Labels: pull-request-available (was: )
> [Python] Out of bounds dictionary indices causes segfault after converting to
> pandas
> ------------------------------------------------------------------------------------
>
> Key: ARROW-1658
> URL: https://issues.apache.org/jira/browse/ARROW-1658
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.7.1
> Reporter: Wes McKinney
> Assignee: Wes McKinney
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Minimal reproduction:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
>
> num = 100
> arr = pa.DictionaryArray.from_arrays(
> np.arange(0, num),
> np.array(['a'], np.object),
> np.zeros(num, np.bool),
> True)
> print(arr.to_pandas())
> {code}
> At no time in the Arrow codebase do we validate that the dictionary indices
> are in bounds. It seems that pandas is overly trusting of the validity of the
> indices. So we should add a method someplace to validate that the dictionary
> non-null indices are not out of bounds (perhaps in
> {{CategoricalBlock::WriteIndices}}).
> As an aside: there may be other times when doing analytics on categorical
> data that external data will have out of bounds index values. We should plan
> for these and decide whether to raise an exception or treat them as null
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)