[
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217
]
Weston Pace edited comment on ARROW-17183 at 7/23/22 12:23 AM:
---------------------------------------------------------------
{quote}
Specifically about this case, we assume that this Fetch and Sort operations are
the most external relations in a Substrait plan. Meaning the Sort or Fetch
operation is called at the end of the other operations. This is not a very
accurate representation. First we need to understand if this is the general
case. cc Weston Pace Jeroen van Straten
{quote}
This is the case in TPC-H. It also tends to be the case in real-world queries
as it is rather tricky / impossible to write a mid-plan sort in SQL.
Subqueries / window functions are the most likely place where you would see a
mid-plan sort and we don't support either at the moment.
{quote}
Another approach is that we define a sink node which can execute a function
which does the expected operation. In some of the defined Sink nodes (KSelect,
Sort) there is a function called `DoFinish`. We should be able to call a custom
function within this call. So from Substrait end when we extract the plan, then
we can write the required `std::function` which would be an option for this
custom sink node. And we assume that a table as input and write the logic. This
way we don't have to introduce new nodes. And what if there are different
capabilities users need and ACERO has a limitation, can we always keep adding
nodes to fullfil that? I am not so sure. This is just a high level thought
process.
{quote}
This special sink node would have to collect all of the data in memory first.
{quote}
Although I have implemented a SortAndFetch node which can perform the fetch
followed by a sort just by following what is being done in Sort and SelectK
nodes. But I am not exactly sure any of these approaches are optimized or the
best way to solve the problem.
{quote}
The biggest general problem here is that a top-k node should not have to
collect all data in memory. It does have to scan all data but it should be
able to throw away data that is obviously larger than K. A SortAndFetch node
should also not have to collect all data in memory. Our current implementation
does. So what you've described is no worse than our current situation. Yet it
is definitely something we should improve at some point. There is ARROW-14202
to improve the top-k node. We could improve SortAndFetch at that time as well.
CC [~ArianaVillegas] as this might be something she wants to consider when
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)
was (Author: westonpace):
{blockquote}
Specifically about this case, we assume that this Fetch and Sort operations are
the most external relations in a Substrait plan. Meaning the Sort or Fetch
operation is called at the end of the other operations. This is not a very
accurate representation. First we need to understand if this is the general
case. cc Weston Pace Jeroen van Straten
{blockquote}
This is the case in TPC-H. It also tends to be the case in real-world queries
as it is rather tricky / impossible to write a mid-plan sort in SQL.
Subqueries / window functions are the most likely place where you would see a
mid-plan sort and we don't support either at the moment.
{blockquote}
Another approach is that we define a sink node which can execute a function
which does the expected operation. In some of the defined Sink nodes (KSelect,
Sort) there is a function called `DoFinish`. We should be able to call a custom
function within this call. So from Substrait end when we extract the plan, then
we can write the required `std::function` which would be an option for this
custom sink node. And we assume that a table as input and write the logic. This
way we don't have to introduce new nodes. And what if there are different
capabilities users need and ACERO has a limitation, can we always keep adding
nodes to fullfil that? I am not so sure. This is just a high level thought
process.
{blockquote}
This special sink node would have to collect all of the data in memory first.
{blockquote}
Although I have implemented a SortAndFetch node which can perform the fetch
followed by a sort just by following what is being done in Sort and SelectK
nodes. But I am not exactly sure any of these approaches are optimized or the
best way to solve the problem.
{blockquote}
The biggest general problem here is that a top-k node should not have to
collect all data in memory. It does have to scan all data but it should be
able to throw away data that is obviously larger than K. A SortAndFetch node
should also not have to collect all data in memory. Our current implementation
does. So what you've described is no worse than our current situation. Yet it
is definitely something we should improve at some point. There is ARROW-14202
to improve the top-k node. We could improve SortAndFetch at that time as well.
CC [~ArianaVillegas] as this might be something she wants to consider when
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)
> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset.
> For instance pick 10 records skipping the first 5 elements. Here we can
> define this as a Slice operation and records can be easily extracted in a
> sink-node.
> Sort and Fetch operation applies when we need to execute a Fetch operation on
> sorted data. The main issue is we cannot have a sort node followed by a
> fetch. The reason is that all existing node definitions supporting sort are
> based on sink nodes. Since there cannot be a node followed by sink, this
> functionality has to take place in a single node.
> But this is not a perfect solution for fetch and sort, but one way to do this
> is define a sink node where the records are sorted and then a set of items
> are fetched.
> Another dilema is what if sort is followed by a fetch. In that case, there
> has to be a flag to enable the order of the operations.
> The objective of this ticket is to discuss a viable efficient solution and
> include new nodes or a method to execute such a logic.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)