westonpace commented on PR #33846:
URL: https://github.com/apache/arrow/pull/33846#issuecomment-1402436528
> I assume there are many cases where using the compute facilities is much
easier than creating an exec plan
The exec plan does not support window functions today which is what we would
need to easily do this rank function. That being said, if it did support
window functions, I would argue that it is easier than using the compute
facilities directly. For example, here is `Sum` support for chunked arrays:
```
std::shared_ptr<Table> ChunkedArrayToTable(std::shared_ptr<ChunkedArray>
chunked_array) {
auto table_schema = schema({field("values", chunked_array->type())});
int64_t length = chunked_array->length();
return Table::Make(std::move(table_schema), {std::move(chunked_array)},
length);
}
Result<std::shared_ptr<ChunkedArray>> Sum(std::shared_ptr<ChunkedArray>
chunked_array,
const ScalarAggregateOptions&
options) {
auto table = ChunkedArrayToTable(std::move(chunked_array));
Declaration plan = Declaration::Sequence(
{{"table_source", TableSourceNodeOptions(std::move(table))},
{"aggregate",
AggregateNodeOptions({{"sum",
std::make_shared<ScalarAggregateOptions>(options),
{0},
"values_sum"}},
{})}});
ARROW_ASSIGN_OR_RAISE(auto sum_table, DeclarationToTable(std::move(plan)));
return sum_table->column(0);
}
TEST(RankTest, Test) {
auto test_array =
ChunkedArrayFromJSON(int32(), {R"([4, 5, null])", R"([9])", R"([7, 6,
10, 8])",
R"([11, 11])", R"([14])", R"([14])",
R"([16])"});
ScalarAggregateOptions options;
ASSERT_OK_AND_ASSIGN(auto result, Sum(std::move(test_array), options));
auto expected = ChunkedArrayFromJSON(int64(), {R"([115])"});
AssertChunkedEqual(*expected, *result);
std::cout << result->ToString() << std::endl;
}
```
> Ideally the vector kernel API for chunked arrays would be based on
iterative chunk processing.
This is how the aggregate and project nodes work today so I think we'd be
reinventing that for non-window function cases.
> With some manpower this would be a good move IMHO, and exec plans could
easily build on top of that.
I don't think we could build exec plans on top of this because exec plans
assume the input doesn't fit in memory which is not possible with chunked
arrays.
Window functions (e.g. rank) do generally require sorted input, and so
"doesn't fit in memory" is a problem case. However, we are getting pretty
close at adding spilling support for hash-join and it would be straightforward
to extend that to order-by. At that point a larger-than-memory rank should be
straightforward.
However, I don't have the manpower to dedicate to this, so as I said before,
I won't stand in the way of a more incremental approach with chunked arrays.
It just seems unfortunate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]