[
https://issues.apache.org/jira/browse/ARROW-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Cook updated ARROW-13509:
-----------------------------
Affects Version/s: 5.0.0
> [C++] Take compute function should pass through ChunkedArray type to handle
> empty input arrays
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-13509
> URL: https://issues.apache.org/jira/browse/ARROW-13509
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 3.0.0, 4.0.0, 5.0.0
> Reporter: &res
> Assignee: Percy Camilo Triveño Aucahuasi
> Priority: Minor
> Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> I'm trying to explode a table (in the pandas sense:
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)]
> As it's not yet supported, I've writen some code to do it using a mix of
> list_flatten and list_parent_indices. It works well, excepted it crashed when
> for empty tables where it crashes.
> {code:python}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0730 15:16:05.164858 13612 chunked_array.cc:48] Check failed:
> (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and
> omitted type
> *** Check failure stack trace: ***Process finished with exit code 134
> (interrupted by signal 6: SIGABRT)
> {code}
> Here's a reproducable example:
> {code:python}
> import sys
> import pyarrow as pa
> from pyarrow import compute
> import pandas as pd
> table = pa.Table.from_arrays(
> [
> pa.array([101, 102, 103], pa.int32()),
> pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string()))
> ],
> names=['key', 'list']
> )
> def explode(table) -> pd.DataFrame:
> exploded_list = compute.list_flatten(table['list'])
> indices = compute.list_parent_indices(table['list'])
> assert indices.type == pa.int32()
> keys = compute.take(table['key'], indices) # <--- Crashes here
> return pa.Table.from_arrays(
> [keys, exploded_list],
> names=['key', 'list_element']
> )
> explode(table).to_pandas().to_markdown(sys.stdout)
> explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't
> work
> {code}
>
> I've narrowed it down to the following:
> when list_parent_indices is called on an empty table it returns this empty
> chunk array:
> {code}
> pa.chunked_array([], pa.int32())
> {code}
> Instead of this chunked array with 1 empty chunk:
> {code}
> pa.chunked_array([pa.array([], pa.int32())])
> {code}
> In turn take doesn't work with the empty chunked aray:
> {code:python}
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
> pa.chunked_array([], pa.int32())) # Bad
> compute.take(pa.chunked_array([pa.array([], pa.int32())]),
> pa.chunked_array([pa.array([], pa.int32())])) # Good
> {code}
> Now in terms of how to fix it there's two solutions:
> * take could accept empty chunked array
> * list_parent_indices could return a chunked array with an empty chunk
> PS: the error message isn't accurate. It says "cannot construct ChunkedArray
> from empty vector and omitted type". But the array being passed has got a
> type (int32) but no chunk. It makes me suspect that something in take strip
> the type of the empty chunked array.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)