[
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935
]
Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--------------------------------------------------
Specifically about this case, we assume that this Fetch and Sort operations are
the most external relations in a Substrait plan. Meaning the Sort or Fetch
operation is called at the end of the other operations. This is not a very
accurate representation. First we need to understand if this is the general
case. cc [~westonpace] [~jvanstraten]
There can be a few settings where these two operations are applied.
*Sort Only*
If the query only has a Sorting operation instead of adding the
`SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`.
*Fetch Only*
If the query only has a Fetch operation, we can include a node with fetch
capability. At the moment we don't have a node with Fetch capability, so this
may need to be included where we could be able to use a siimilar logic used in
`SelectK` node.
*SortAndFetch or FetchAndSort*
If the query contains both sort and fetch in a given order, there has to be a
single Sink node which can do this operation by the given order. When scanning
the plan components, when we find a sort we just add a `OrderBySink` and keep
adding other relations. If we find a Fetch operation, this needs to be replaced
with a SortAndFetch operation where sorting is done first and fetching is done
next. And this can go vice-versa.
*Another Approach:*
Another approach is that we define a sink node which can execute a function
which does the expected operation. In some of the defined Sink nodes (KSelect,
Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to
call a custom function within this call. So from Substrait end when we extract
the plan, then we can write the required `std::function` which would be an
option for this custom sink node. And we assume that a table as input and write
the logic. This way we don't have to introduce new nodes. And what if there are
different capabilities users need and ACERO has a limitation, can we always
keep adding nodes to fullfil that? I am not so sure. This is just a high level
thought process.
Although I have implemented a _SortAndFetch_ node which can perform the fetch
followed by a sort just by following what is being done in Sort and SelectK
nodes. But I am not exactly sure any of these approaches are optimized or the
best way to solve the problem.
This is the doubtful component which I am not quite clear what would be the
most optimize way to include this capability. Or if there is a better way to do
this. Appreciate your thoughts.
cc [~westonpace] [~jvanstraten] [~bkietz] [~icook]
> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset.
> For instance pick 10 records skipping the first 5 elements. Here we can
> define this as a Slice operation and records can be easily extracted in a
> sink-node.
> Sort and Fetch operation applies when we need to execute a Fetch operation on
> sorted data. The main issue is we cannot have a sort node followed by a
> fetch. The reason is that all existing node definitions supporting sort are
> based on sink nodes. Since there cannot be a node followed by sink, this
> functionality has to take place in a single node.
> But this is not a perfect solution for fetch and sort, but one way to do this
> is define a sink node where the records are sorted and then a set of items
> are fetched.
> Another dilema is what if sort is followed by a fetch. In that case, there
> has to be a flag to enable the order of the operations.
> The objective of this ticket is to discuss a viable efficient solution and
> include new nodes or a method to execute such a logic.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)