Artem KOZHEVNIKOV created ARROW-5713: ----------------------------------------
Summary: fancy indexing on pa.array Key: ARROW-5713 URL: https://issues.apache.org/jira/browse/ARROW-5713 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Artem KOZHEVNIKOV In numpy one can do : {code:java} In [2]: import numpy as np In [3]: a = np.array(['a', 'bb', 'ccc', 'dddd'], dtype="O") In [4]: indices = np.array([0, -1, 2, 2, 0, 3]) In [5]: a[indices] Out[5]: array(['a', 'dddd', 'ccc', 'ccc', 'a', 'dddd'], dtype=object) {code} It would be nice to have a similar feature in pyarrow. Currently, pa.arrow __getitem__ supports only a slice or a single element as an argument. Of course, using that we've some workarounds, like below {code:java} In [6]: import pyarrow as pa In [7]: a = pa.array(['a', 'bb', 'ccc', 'dddd']) In [8]: pa.array(a.to_pandas()[indices]) # if len(indices) is high Out[8]: <pyarrow.lib.StringArray object at 0x91bd845e8> [ "a", "dddd", "ccc", "ccc", "a", "dddd" ] In [9]: pa.array([a[i].as_py() for i in indices]) # if len(indices) is low Out[9]: <pyarrow.lib.StringArray object at 0x91bc14868> [ "a", "dddd", "ccc", "ccc", "a", "dddd" ] {code} both are not memory&cpu efficient. -- This message was sent by Atlassian JIRA (v7.6.3#76005)