AlenkaF commented on issue #14955:
URL: https://github.com/apache/arrow/issues/14955#issuecomment-4854677086

   This issue still exists, see most possible [todo from the 
codebase](https://github.com/apache/arrow/blob/7ebe6e9a62f9f22abe0b5c79013c40649bc77a5c/cpp/src/arrow/compute/row/grouper.cc#L696).
 It can be tackled, in case there is a need and will to do so, or closed as 
**Not Planned** (which I would do).
   
   The error happens in a multi-chunk table since a single-chunk dictionary 
array works fine. The suggested workaround is actually a sensible step to take 
(unifying the dictionaries):
   
   ```python
   In [1]: import pyarrow as pa 
      ...: chunk1 = pa.DictionaryArray.from_arrays(
      ...:     pa.array([0, 1, 0], type=pa.int32()),
      ...:     pa.array(["a", "b"])
      ...: )
      ...: chunk2 = pa.DictionaryArray.from_arrays(
      ...:     pa.array([0, 1, 2], type=pa.int32()),
      ...:     pa.array(["b", "c", "d"])  # different dictionary!               
                                                                                
                                                                                
                                                                      
      ...: )
      ...: 
      ...: tbl = pa.table({
      ...:     "k": pa.chunked_array([chunk1, chunk2]),
      ...:     "v": pa.chunked_array([[1, 2, 3], [4, 5, 6]])
      ...: })
   In [2]: tbl
   Out[2]: 
   pyarrow.Table
   k: dictionary<values=string, indices=int32, ordered=0>
   v: int64
   ----
   k: [  -- dictionary:
   ["a","b"]  -- indices:
   [0,1,0],  -- dictionary:
   ["b","c","d"]  -- indices:
   [0,1,2]]
   v: [[1,2,3],[4,5,6]]
   
   In [3]: tbl.group_by("k").aggregate([("v", "sum")])
   ...
   ArrowNotImplementedError: Unifying differing dictionaries
   ...
   
   In [4]: tbl.unify_dictionaries()
   Out[4]: 
   pyarrow.Table
   k: dictionary<values=string, indices=int32, ordered=0>
   v: int64
   ----
   k: [  -- dictionary:
   ["a","b","c","d"]  -- indices:
   [0,1,0],  -- dictionary:
   ["a","b","c","d"]  -- indices:
   [1,2,3]]
   v: [[1,2,3],[4,5,6]]
   
   In [5]: tbl.unify_dictionaries().group_by("k").aggregate([("v", "sum")])
   Out[5]: 
   pyarrow.Table
   k: dictionary<values=string, indices=int32, ordered=0>
   v_sum: int64
   ----
   k: [  -- dictionary:
   ["a","b","c","d"]  -- indices:
   [1,2,3,0]]
   v_sum: [[6,5,6,4]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to