dmbates opened a new issue, #360:
URL: https://github.com/apache/arrow-julia/issues/360
When, e.g. a `PooledArray` column, that contains missing values is converted
to `DictEncoded` the dictionary is based on the result of `DataAPI.refpool`,
which includes `missing`. As a result both the dictionary and the index vector
contain missing values, which confuses Pandas. The missing value in the
dictionary can be skipped because it is never referenced in the index vector.
```julia
julia> using Arrow, DataAPI, PooledArrays
julia> tbl = (; a = PooledArray([missing, "a", "b", "a"]))
(a = Union{Missing, String}[missing, "a", "b", "a"],)
julia> DataAPI.refarray(tbl.a)
4-element Vector{UInt32}:
0x00000001
0x00000002
0x00000003
0x00000002
julia> DataAPI.refpool(tbl.a)
3-element Vector{Union{Missing, String}}:
missing
"a"
"b"
julia> Arrow.write("tbl.arrow", tbl)
"tbl.arrow"
```
In the `read_table` result we see that there is a `null` in the dictionary
at Python index 0 that is never referenced in the indices vector.
```python
$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC
10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.feather as fea
>>> fea.read_table("tbl.arrow")
pyarrow.Table
a: dictionary<values=string, indices=int8, ordered=0>
----
a: [ -- dictionary:
[null,"a","b"] -- indices:
[null,1,2,1]]
>>> fea.read_feather('nyc_mv_collisions_202201.arrow')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py",
line 231, in read_feather
return (read_table(
File "pyarrow/array.pxi", line 823, in
pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
line 818, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
line 1170, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
line 1170, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
line 757, in _reconstruct_block
cat = _pandas_api.categorical_type.from_codes(
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py",
line 687, in from_codes
dtype = CategoricalDtype._from_values_or_dtype(
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
line 299, in _from_values_or_dtype
dtype = CategoricalDtype(categories, ordered)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
line 186, in __init__
self._finalize(categories, ordered, fastpath=False)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
line 340, in _finalize
categories = self.validate_categories(categories, fastpath=fastpath)
File
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
line 534, in validate_categories
raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null
```
One possible approach is to check for `missing` in the refpool, find its
index in the refpool, delete it from the refpool and rewrite the refarray to
replace that index by missing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]