[ 
https://issues.apache.org/jira/browse/ARROW-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391579#comment-17391579
 ] 

David Li commented on ARROW-13509:
----------------------------------

Thanks for the report. The error is most likely because the Take kernel 
implementation omits the type when constructing the chunked array, but it could 
and should pass through the type from the input arrays. See 
[TakeCC|https://github.com/apache/arrow/blob/c51e4a179379628578a69f536ffca80a844efcd2/cpp/src/arrow/compute/kernels/vector_selection.cc#L2038].
 Additionally I can confirm this still affects 5.0.0. 

> [C++] Cannot "explode" empty table
> ----------------------------------
>
>                 Key: ARROW-13509
>                 URL: https://issues.apache.org/jira/browse/ARROW-13509
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: &res
>            Priority: Minor
>
> I'm trying to explode a table (in the pandas sense: 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]
> As it's not yet supported, I've writen some code to do it using a mix of 
> list_flatten and list_parent_indices. It works well, excepted it crashed when 
> for empty tables where it crashes.
> {code:python}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0730 15:16:05.164858 13612 chunked_array.cc:48]  Check failed: 
> (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and 
> omitted type
> *** Check failure stack trace: ***Process finished with exit code 134 
> (interrupted by signal 6: SIGABRT)
> {code}
> Here's a reproducable example:
> {code:python}
> import sys
> import pyarrow as pa
> from pyarrow import compute
> import pandas as pd
> table = pa.Table.from_arrays(
>     [
>         pa.array([101, 102, 103], pa.int32()),
>         pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
>     ],
>     names=['key', 'list']
> )
> def explode(table) -> pd.DataFrame:
>     exploded_list = compute.list_flatten(table['list'])
>     indices = compute.list_parent_indices(table['list'])
>     assert indices.type == pa.int32()
>     keys = compute.take(table['key'], indices)  # <--- Crashes here
>     return pa.Table.from_arrays(
>         [keys, exploded_list],
>         names=['key', 'list_element']
>     )
> explode(table).to_pandas().to_markdown(sys.stdout)
> explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't 
> work
> {code}
>  
> I've narrowed it down to the following: 
> when list_parent_indices is called on an empty table it returns this empty 
> chunk array:
> {code}
> pa.chunked_array([], pa.int32())
> {code}
> Instead of this chunked array with 1 empty chunk:
> {code}
> pa.chunked_array([pa.array([], pa.int32())])
> {code}
> In turn take doesn't work with the empty chunked aray:
> {code:python}
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>              pa.chunked_array([], pa.int32())) # Bad
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
>              pa.chunked_array([pa.array([], pa.int32())])) # Good
> {code}
> Now in terms of how to fix it there's two solutions:
> * take could accept empty chunked array
> * list_parent_indices could return a chunked array with an empty chunk
> PS: the error message isn't accurate. It says "cannot construct ChunkedArray 
> from empty vector and omitted type". But the array being passed has got a 
> type (int32) but no chunk. It makes me suspect that something in take strip 
> the type of the empty chunked array.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to