[ 
https://issues.apache.org/jira/browse/ARROW-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13509:
------------------------------------
    Labels: kernel pull-request-available  (was: pull-request-available)

> [C++] Take compute function should pass through ChunkedArray type to handle 
> empty input arrays
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13509
>                 URL: https://issues.apache.org/jira/browse/ARROW-13509
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: &res
>            Assignee: Percy Camilo Triveño Aucahuasi
>            Priority: Minor
>              Labels: kernel, pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I'm trying to explode a table (in the pandas sense: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]
> As it's not yet supported, I've writen some code to do it using a mix of 
> list_flatten and list_parent_indices. It works well, excepted it crashed when 
> for empty tables where it crashes.
> {code:python}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0730 15:16:05.164858 13612 chunked_array.cc:48]  Check failed: 
> (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and 
> omitted type
> *** Check failure stack trace: ***Process finished with exit code 134 
> (interrupted by signal 6: SIGABRT)
> {code}
> Here's a reproducable example:
> {code:python}
> import sys
> import pyarrow as pa
> from pyarrow import compute
> import pandas as pd
> table = pa.Table.from_arrays(
>     [
>         pa.array([101, 102, 103], pa.int32()),
>         pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
>     ],
>     names=['key', 'list']
> )
> def explode(table) -> pd.DataFrame:
>     exploded_list = compute.list_flatten(table['list'])
>     indices = compute.list_parent_indices(table['list'])
>     assert indices.type == pa.int32()
>     keys = compute.take(table['key'], indices)  # <--- Crashes here
>     return pa.Table.from_arrays(
>         [keys, exploded_list],
>         names=['key', 'list_element']
>     )
> explode(table).to_pandas().to_markdown(sys.stdout)
> explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't 
> work
> {code}
>  
> I've narrowed it down to the following: 
> when list_parent_indices is called on an empty table it returns this empty 
> chunk array:
> {code}
> pa.chunked_array([], pa.int32())
> {code}
> Instead of this chunked array with 1 empty chunk:
> {code}
> pa.chunked_array([pa.array([], pa.int32())])
> {code}
> In turn take doesn't work with the empty chunked aray:
> {code:python}
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>              pa.chunked_array([], pa.int32())) # Bad
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>              pa.chunked_array([pa.array([], pa.int32())])) # Good
> {code}
> Now in terms of how to fix it there's two solutions:
> * take could accept empty chunked array
> * list_parent_indices could return a chunked array with an empty chunk
> PS: the error message isn't accurate. It says "cannot construct ChunkedArray 
> from empty vector and omitted type". But the array being passed has got a 
> type (int32) but no chunk. It makes me suspect that something in take strip 
> the type of the empty chunked array.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to