[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

Artem KOZHEVNIKOV (Jira) Wed, 21 Aug 2019 23:43:20 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912675#comment-16912675
 ]


Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/22/19 6:42 AM:
-------------------------------------------------------------------

if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
    indices = np.array(indices, dtype=np.int)
    if indices.max() > len(charr):
        raise IndexError()

    indices[indices < 0] += len(charr)

    if indices.min() < 0:
        raise IndexError()

    lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
    cum_lengths = lengths.cumsum()

    bins = np.searchsorted(cum_lengths, indices, side="right")
    limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

    sort_idx = get_group_index_sorter(bins, len(cum_lengths))
    del bins

    indices = indices[sort_idx]
    sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

    cum_lengths -= lengths
    res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
                                  for i, cum_length in enumerate(cum_lengths)])
    return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
and concat_arrays (or we want to avoid an extra copy) ?


was (Author: artemk):
if it were in pure python, we could do something like below (relying on 
`pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
from pandas.core.sorting import get_group_index_sorter

def take_on_chunked_array(charr, indices):
    indices = np.array(indices, dtype=np.int)
    if indices.max() > len(charr):
        raise IndexError()

    indices[indices < 0] += len(charr)

    if indices.min() < 0:
        raise IndexError()

    lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
    cum_lengths = lengths.cumsum()

    bins = np.searchsorted(cum_lengths, indices, side="right")
    limits_idx = np.concatenate([[0], np.bincount(bins).cumsum()])

    sort_idx = get_group_index_sorter(bins, len(cum_lengths))
    del bins

    indices = indices[sort_idx]
    sort_idx = np.argsort(sort_idx, kind="merge")  # inverse sort indices

    cum_lengths -= lengths
    res_array = 
pa.concat_arrays([charr.chunks[i].take(pa.array(indices[limits_idx[i]:limits_idx[i
 + 1]] - cum_length))
                                  for i, cum_length in enumerate(cum_lengths)])
    return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), pa.array([5, 
6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()

{code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method 
(or we want to avoid an extra copy) ? 

> [C++] Implement Take on ChunkedArray for DataFrame use
> ------------------------------------------------------
>
>                 Key: ARROW-5454
>                 URL: https://issues.apache.org/jira/browse/ARROW-5454
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

Reply via email to