Victor Uriarte created ARROW-1992: ------------------------------------- Summary: to_pandas crashes when using strings_to_categoricals on empty string cols on 0.8.0 Key: ARROW-1992 URL: https://issues.apache.org/jira/browse/ARROW-1992 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.8.0 Environment: OS: Windows Python: PY36 x64 Pandas: 0.22.0 pyarrow: 0.8.0 Reporter: Victor Uriarte Fix For: 0.7.1
When trying to read back a table, Python crashes when pyarrow is used to read/convert a table that has a column of 0 length `strings and strings_to_categorical=True`. Example code below. This same test ran ok with pyarrow 0.7.1 ```python import pathlib import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({ 'Foo': ['A', 'A', 'B', 'B', 'C'], 'Bar': ['A1', 'A2', 'B2', 'D3', ''], 'Baz': ['', '', '', '', ''], }) test_dir = pathlib.Path(__file__).parent / 'test_bug' test_dir.mkdir(parents=True, exist_ok=True) table = pa.Table.from_pandas(df) path = test_dir / 'file1.parquet' path = str(path) # write_table doesn't support `pathlib.Path` objects pq.write_table(table, path) path = test_dir / 'file2.parquet' path = str(path) # write_table doesn't support `pathlib.Path` objects pq.write_table(table, path) path = str(test_dir) # write_table doesn't support `pathlib.Path` objects data_set = pq.ParquetDataset(path) table = data_set.read() df2 = table.to_pandas(strings_to_categorical=True) print(len(df2)) ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029)