[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Weston Pace (Jira) Fri, 22 Jul 2022 17:24:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217
 ]


Weston Pace edited comment on ARROW-17183 at 7/23/22 12:23 AM:
---------------------------------------------------------------

{quote}
Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc Weston Pace Jeroen van Straten 
{quote}

This is the case in TPC-H.  It also tends to be the case in real-world queries 
as it is rather tricky / impossible to write a mid-plan sort in SQL.  
Subqueries / window functions are the most likely place where you would see a 
mid-plan sort and we don't support either at the moment.

{quote}
Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called `DoFinish`. We should be able to call a custom 
function within this call. So from Substrait end when we extract the plan, then 
we can write the required `std::function` which would be an option for this 
custom sink node. And we assume that a table as input and write the logic. This 
way we don't have to introduce new nodes. And what if there are different 
capabilities users need and ACERO has a limitation, can we always keep adding 
nodes to fullfil that? I am not so sure. This is just a high level thought 
process.
{quote}

This special sink node would have to collect all of the data in memory first.

{quote}
Although I have implemented a SortAndFetch node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  
{quote}

The biggest general problem here is that a top-k node should not have to 
collect all data in memory.  It does have to scan all data but it should be 
able to throw away data that is obviously larger than K.  A SortAndFetch node 
should also not have to collect all data in memory.  Our current implementation 
does.  So what you've described is no worse than our current situation. Yet it 
is definitely something we should improve at some point.  There is ARROW-14202 
to improve the top-k node.  We could improve SortAndFetch at that time as well. 
 CC [~ArianaVillegas] as this might be something she wants to consider when 
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)


was (Author: westonpace):
{blockquote}
Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc Weston Pace Jeroen van Straten 
{blockquote}

This is the case in TPC-H.  It also tends to be the case in real-world queries 
as it is rather tricky / impossible to write a mid-plan sort in SQL.  
Subqueries / window functions are the most likely place where you would see a 
mid-plan sort and we don't support either at the moment.

{blockquote}
Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called `DoFinish`. We should be able to call a custom 
function within this call. So from Substrait end when we extract the plan, then 
we can write the required `std::function` which would be an option for this 
custom sink node. And we assume that a table as input and write the logic. This 
way we don't have to introduce new nodes. And what if there are different 
capabilities users need and ACERO has a limitation, can we always keep adding 
nodes to fullfil that? I am not so sure. This is just a high level thought 
process.
{blockquote}

This special sink node would have to collect all of the data in memory first.

{blockquote}
Although I have implemented a SortAndFetch node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  
{blockquote}

The biggest general problem here is that a top-k node should not have to 
collect all data in memory.  It does have to scan all data but it should be 
able to throw away data that is obviously larger than K.  A SortAndFetch node 
should also not have to collect all data in memory.  Our current implementation 
does.  So what you've described is no worse than our current situation. Yet it 
is definitely something we should improve at some point.  There is ARROW-14202 
to improve the top-k node.  We could improve SortAndFetch at that time as well. 
 CC [~ArianaVillegas] as this might be something she wants to consider when 
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)

> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
>                 Key: ARROW-17183
>                 URL: https://issues.apache.org/jira/browse/ARROW-17183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Reply via email to