[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912675#comment-16912675 ]
Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:42 AM: ------------------------------------------------------------------- if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method and concat_arrays (or we want to avoid an extra copy) ? was (Author: artemk): if it were in pure python, we could do something like below (relying on `pa.array.take`) {code:python} import numpy as np import pyarrow as pa from pandas.core.sorting import get_group_index_sorter def take_on_chunked_array(charr, indices): indices = np.array(indices, dtype=np.int) if indices.max() > len(charr): raise IndexError() indices[indices < 0] += len(charr) if indices.min() < 0: raise IndexError() lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64) cum_lengths = lengths.cumsum() bins = np.searchsorted(cum_lengths, indices, side="right") limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()]) sort_idx = get_group_index_sorter(bins, len(cum_lengths)) del bins indices = indices[sort_idx] sort_idx = np.argsort(sort_idx, kind="merge") # inverse sort indices cum_lengths -= lengths res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i + 1]] - cum_length)) for i, cum_length in enumerate(cum_lengths)]) return res_array.take(pa.array(sort_idx)) charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 6, 7, 8])]) take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy() pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code} Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? > [C++] Implement Take on ChunkedArray for DataFrame use > ------------------------------------------------------ > > Key: ARROW-5454 > URL: https://issues.apache.org/jira/browse/ARROW-5454 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Priority: Major > Fix For: 1.0.0 > > > Follow up to ARROW-2667 -- This message was sent by Atlassian Jira (v8.3.2#803003)