(arrow) branch main updated: GH-40153: [Python] Avoid using np.take in Array.to_numpy() (#40295)

apitrou Thu, 29 Feb 2024 10:14:23 -0800

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/main by this push:
     new 5c4869d453 GH-40153: [Python] Avoid using np.take in Array.to_numpy() 
(#40295)
5c4869d453 is described below

commit 5c4869d453717d8946549cac408998759c5084d9
Author: Antoine Pitrou <[email protected]>
AuthorDate: Thu Feb 29 19:14:13 2024 +0100

    GH-40153: [Python] Avoid using np.take in Array.to_numpy() (#40295)
    
    ### Rationale for this change
    
    `Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails 
on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would 
like to downcast them.
    
    ### What changes are included in this PR?
    
    Avoid calling `np.take`, instead using our own dictionary decoding routine.
    
    ### Are these changes tested?
    
    Yes. A test failure is fixed on 32-bit.
    
    ### Are there any user-facing changes?
    
    No.
    * GitHub Issue: #40153
    
    Authored-by: Antoine Pitrou <[email protected]>
    Signed-off-by: Antoine Pitrou <[email protected]>
---
 python/pyarrow/array.pxi                           | 5 +----
 python/pyarrow/src/arrow/python/arrow_to_pandas.cc | 2 ++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index e1bf494920..7d9b65c77d 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -1573,7 +1573,7 @@ cdef class Array(_PandasConvertible):
         # decoding the dictionary will make sure nulls are correctly handled.
         # Decoding a dictionary does imply a copy by the way,
         # so it can't be done if the user requested a zero_copy.
-        c_options.decode_dictionaries = not zero_copy_only
+        c_options.decode_dictionaries = True
         c_options.zero_copy_only = zero_copy_only
         c_options.to_numpy = True
 
@@ -1585,9 +1585,6 @@ cdef class Array(_PandasConvertible):
         # always convert to numpy array without pandas dependency
         array = PyObject_to_object(out)
 
-        if isinstance(array, dict):
-            array = np.take(array['dictionary'], array['indices'])
-
         if writable and not array.flags.writeable:
             # if the conversion already needed to a copy, writeable is True
             array = array.copy()
diff --git a/python/pyarrow/src/arrow/python/arrow_to_pandas.cc 
b/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
index 2115cd8015..cb9cbe5b93 100644
--- a/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
+++ b/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
@@ -2515,6 +2515,8 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& 
options,
                                    std::shared_ptr<ChunkedArray> arr, 
PyObject* py_ref,
                                    PyObject** out) {
   if (options.decode_dictionaries && arr->type()->id() == Type::DICTIONARY) {
+    // XXX we should return an error as below if options.zero_copy_only
+    // is true, but that would break compatibility with existing tests.
     const auto& dense_type =
         checked_cast<const DictionaryType&>(*arr->type()).value_type();
     RETURN_NOT_OK(DecodeDictionaries(options.pool, dense_type, &arr));

(arrow) branch main updated: GH-40153: [Python] Avoid using np.take in Array.to_numpy() (#40295)

Reply via email to