[GitHub] [arrow-julia] dmbates opened a new issue, #360: Creating DictEncoded in the presence of missing values

GitBox Wed, 09 Nov 2022 07:51:14 -0800


dmbates opened a new issue, #360:
URL: https://github.com/apache/arrow-julia/issues/360


   When, e.g. a `PooledArray` column, that contains missing values is converted 
to `DictEncoded` the dictionary is based on the result of `DataAPI.refpool`, 
which includes `missing`.  As a result both the dictionary and the index vector 
contain missing values, which confuses Pandas.  The missing value in the 
dictionary can be skipped because it is never referenced in the index vector.
   
   ```julia
   julia> using Arrow, DataAPI, PooledArrays
   
   julia> tbl = (; a = PooledArray([missing, "a", "b", "a"]))
   (a = Union{Missing, String}[missing, "a", "b", "a"],)
   
   julia> DataAPI.refarray(tbl.a)
   4-element Vector{UInt32}:
    0x00000001
    0x00000002
    0x00000003
    0x00000002
   
   julia> DataAPI.refpool(tbl.a)
   3-element Vector{Union{Missing, String}}:
    missing
    "a"
    "b"
   
   julia> Arrow.write("tbl.arrow", tbl)
   "tbl.arrow"
   ```
   
   In the `read_table` result we see that there is a `null` in the dictionary 
at Python index 0 that is never referenced in the indices vector.
   
   ```python
   $ python
   Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 
10.4.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow.feather as fea
   >>> fea.read_table("tbl.arrow")
   pyarrow.Table
   a: dictionary<values=string, indices=int8, ordered=0>
   ----
   a: [  -- dictionary:
   [null,"a","b"]  -- indices:
   [null,1,2,1]]
   >>> fea.read_feather('nyc_mv_collisions_202201.arrow')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py", 
line 231, in read_feather
       return (read_table(
     File "pyarrow/array.pxi", line 823, in 
pyarrow.lib._PandasConvertible.to_pandas
     File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 818, in table_to_blockmanager
       blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 1170, in _table_to_blocks
       return [_reconstruct_block(item, columns, extension_columns)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 1170, in <listcomp>
       return [_reconstruct_block(item, columns, extension_columns)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 757, in _reconstruct_block
       cat = _pandas_api.categorical_type.from_codes(
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py",
 line 687, in from_codes
       dtype = CategoricalDtype._from_values_or_dtype(
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 299, in _from_values_or_dtype
       dtype = CategoricalDtype(categories, ordered)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 186, in __init__
       self._finalize(categories, ordered, fastpath=False)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 340, in _finalize
       categories = self.validate_categories(categories, fastpath=fastpath)
     File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 534, in validate_categories
       raise ValueError("Categorical categories cannot be null")
   ValueError: Categorical categories cannot be null
   ```
   
   One possible approach is to check for `missing` in the refpool, find its 
index in the refpool, delete it from the refpool and rewrite the refarray to 
replace that index by missing.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-julia] dmbates opened a new issue, #360: Creating DictEncoded in the presence of missing values

Reply via email to