Arthur Passos created ARROW-18307:
-------------------------------------
Summary: [C++] Read list/array data from ChunkedArray with
multiple chunks
Key: ARROW-18307
URL: https://issues.apache.org/jira/browse/ARROW-18307
Project: Apache Arrow
Issue Type: Test
Components: C++
Reporter: Arthur Passos
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table
returned contains columns with multiple chunks (column->num_chunks() > 1). The
column in question, although not limited to, is of type Array(Int64).
I want to convert this arrow column into an internal structure that contains a
contiguous chunk of memory for the data and a vector of offsets, very similar
to arrow's structure. The code I have so far works in two "phases":
1. Get nested arrow column data. In that case, get Int64 data out of
Array(Int64).
2. Get offsets from Array(Int64).
To achieve the #1, I am looping over the chunks and storing
arrow::Array::values into a new arrow::ChunkedArray.
{code:java}
static std::shared_ptr<arrow::ChunkedArray>
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks =
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks;
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray
&>(*(arrow_column->chunk(chunk_i)));
std::shared_ptr<arrow::Array> chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared<arrow::ChunkedArray>(array_vector);
}{code}
This does not work as expected, tho. Even though there are multiple chunks, the
arrow::Array::values method returns the very same buffer for all of them, which
ends up duplicating the data on my side.
I then looked through more examples and came across the [ColumnarTableToVector
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
It looks like this example assumes there is only on chunk and ignores the
possibility of it having multiple chunks. It's probably just a detail and the
test wasn't actually intended to cover multiple chunks.
I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray
&>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray
&>(*(arrow_column->chunk(1)));
auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();
auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);
auto lcv1 = dynamic_cast<::arrow::ListArray
&>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset -
l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray
&>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset -
l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.
Hence, my question: How do I properly extract the data & offsets out of such
column? A more generic version of this is: how to extract the data out of
ChunkedArrays with multiple chunks?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)