[GitHub] [arrow] westonpace commented on pull request #33846: GH-32050: [C++] Implement Rank kernel on chunked arrays

via GitHub Tue, 24 Jan 2023 10:55:33 -0800


westonpace commented on PR #33846:
URL: https://github.com/apache/arrow/pull/33846#issuecomment-1402436528


   > I assume there are many cases where using the compute facilities is much 
easier than creating an exec plan
   
   The exec plan does not support window functions today which is what we would 
need to easily do this rank function.  That being said, if it did support 
window functions, I would argue that it is easier than using the compute 
facilities directly.  For example, here is `Sum` support for chunked arrays:
   
   ```
   std::shared_ptr<Table> ChunkedArrayToTable(std::shared_ptr<ChunkedArray> 
chunked_array) {
     auto table_schema = schema({field("values", chunked_array->type())});
     int64_t length = chunked_array->length();
     return Table::Make(std::move(table_schema), {std::move(chunked_array)}, 
length);
   }
   
   Result<std::shared_ptr<ChunkedArray>> Sum(std::shared_ptr<ChunkedArray> 
chunked_array,
                                             const ScalarAggregateOptions& 
options) {
     auto table = ChunkedArrayToTable(std::move(chunked_array));
     Declaration plan = Declaration::Sequence(
         {{"table_source", TableSourceNodeOptions(std::move(table))},
          {"aggregate",
           AggregateNodeOptions({{"sum",
                                  
std::make_shared<ScalarAggregateOptions>(options),
                                  {0},
                                  "values_sum"}},
                                {})}});
     ARROW_ASSIGN_OR_RAISE(auto sum_table, DeclarationToTable(std::move(plan)));
     return sum_table->column(0);
   }
   
   TEST(RankTest, Test) {
     auto test_array =
         ChunkedArrayFromJSON(int32(), {R"([4, 5, null])", R"([9])", R"([7, 6, 
10, 8])",
                                        R"([11, 11])", R"([14])", R"([14])", 
R"([16])"});
     ScalarAggregateOptions options;
     ASSERT_OK_AND_ASSIGN(auto result, Sum(std::move(test_array), options));
     auto expected = ChunkedArrayFromJSON(int64(), {R"([115])"});
     AssertChunkedEqual(*expected, *result);
     std::cout << result->ToString() << std::endl;
   }
   ```
   
   > Ideally the vector kernel API for chunked arrays would be based on 
iterative chunk processing.
   
   This is how the aggregate and project nodes work today so I think we'd be 
reinventing that for non-window function cases.
   
   > With some manpower this would be a good move IMHO, and exec plans could 
easily build on top of that.
   
   I don't think we could build exec plans on top of this because exec plans 
assume the input doesn't fit in memory which is not possible with chunked 
arrays.
   
   Window functions (e.g. rank) do generally require sorted input, and so 
"doesn't fit in memory" is a problem case.  However, we are getting pretty 
close at adding spilling support for hash-join and it would be straightforward 
to extend that to order-by.  At that point a larger-than-memory rank should be 
straightforward.
   
   However, I don't have the manpower to dedicate to this, so as I said before, 
I won't stand in the way of a more incremental approach with chunked arrays.  
It just seems unfortunate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #33846: GH-32050: [C++] Implement Rank kernel on chunked arrays

Reply via email to