[ 
https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983548#comment-16983548
 ] 

Joris Van den Bossche commented on ARROW-7266:
----------------------------------------------

[~adamhooper] Thanks of the report!

This seems to be specific to the string type, as I don't see a similar bug for 
integer type:

{code}
In [7]: a = pa.array(['a', 'b', 'c', 'b'])                                      
                                                                                
                                                   

In [9]: a[1:].dictionary_encode()                                               
                                                                                
                                                   
Out[9]: 
<pyarrow.lib.DictionaryArray object at 0x7f677975e128>

-- dictionary:
  [
    "c",
    "b",
    ""
  ]
-- indices:
  [
    0,
    1,
    2
  ]

In [10]: a = pa.array([1, 2, 3, 2])                                             
                                                                                
                                                   

In [12]: a[1:].dictionary_encode()                                              
                                                                                
                                                   
Out[12]: 
<pyarrow.lib.DictionaryArray object at 0x7f6776f5f208>

-- dictionary:
  [
    2,
    3
  ]
-- indices:
  [
    0,
    1,
    0
  ]

{code}


>  Is there a way to copy the slice to a new offset-0 StringArray that I could 
> then dictionary-encode? 

At least in the current pyarrow API, I don't think such a functionality is 
exposed (apart from getting buffers, slicing/copying, and recreating an array)

> [Python] dictionary_encode() of a slice gives wrong result
> ----------------------------------------------------------
>
>                 Key: ARROW-7266
>                 URL: https://issues.apache.org/jira/browse/ARROW-7266
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.1
>         Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
>            Reporter: Adam Hooper
>            Priority: Major
>
> Steps to reproduce:
> {code:python}
> import pyarrow as pa
> arr = pa.array(["a", "b", "b", "b"])[1:]
> arr.dictionary_encode()
> {code}
> Expected results:
> {code}
> -- dictionary:
>   [
>     "b"
>   ]
> -- indices:
>   [
>     0,
>     0,
>     0
>   ]
> {code}
> Actual results:
> {code}
> -- dictionary:
>   [
>     "b",
>     ""
>   ]
> -- indices:
>   [
>     0,
>     0,
>     1
>   ]
> {code}
> I don't know a workaround. Converting to pylist and back is too slow. Is 
> there a way to copy the slice to a new offset-0 StringArray that I could then 
> dictionary-encode? Otherwise, I'm considering building buffers by hand....



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to