[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Jeroen van Straten (Jira) Mon, 25 Jul 2022 05:41:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570867#comment-17570867
 ]


Jeroen van Straten edited comment on ARROW-17183 at 7/25/22 12:40 PM:
----------------------------------------------------------------------

{quote}IMHO, what would be better is to write the Acero query and convert it to 
a Substrait plan, and then optimize this plan using a third-party optimizer. 
May be there could be something like substrait-optimizer in future (I really 
don't know). And use this optimized plan to create the Acero plan again.
{quote}
I'm also assuming that this will exist at some point. What I'm saying is that 
it's unlikely to optimize for Acero-specific things; it will optimize Substrait 
core relations and functions using the properties associated with those by the 
spec, such as order maintenance. If an optimal Substrait plan then degrades in 
performance just by converting it back to Acero's format due to architectural 
differences between Acero and Substrait, you're still going to need to do some 
basic optimizations afterward.
{quote}Yes, if Acero expects to inherit all these core features of the database 
it must do what they suppose to do, no argument there. Since Acero is an 
streaming execution engine, how far are we reaching for those goals are not yet 
clear to me. But at the end of the day, if we are benchmarking our performance 
with other systems, it would be the best to support such features as optimized 
as possible.
{quote}
It's fair enough if anything that doesn't perfectly fit a streaming paradigm is 
a non-goal for Acero, but then you can't support much of core Substrait as it's 
currently defined, and a generic Substrait optimizer would certainly make a 
mess. At that point I would question what we're (ab)using Substrait for; you 
can't have your cake and eat it too. If we just want to use Substrait as a 
simple means of serializing Acero plans and only have a flawed appearance of 
compatibility with systems that don't explicitly support Acero's dialect, we 
could have saved ourselves a whole lot of trouble (protobuf linking issues, 
anyone?) by just rolling our own format from the start. For some level of 
compatibility we could then have rolled out a converter from Substrait to 
Acero's serialization format outside of Arrow, if only to quarantine protobuf 
from the rest of libarrow.

Substrait core is intended to be a more or less minimal subset of what should 
be expected from an execution engine. If we can't or don't want to meet those 
expectations because we want more flexibility to optimize, we should propose to 
change Substrait, and if those changes are rejected by the community as being 
too Arrow-specific, IMO Substrait is just not for us. We should at least not be 
treating it as a first-class citizen for connecting Acero to Ibis and other 
query APIs in that case.


was (Author: JIRAUSER282962):
{quote}Acero. IMHO, what would be better is to write the Acero query and 
convert it to a Substrait plan, and then optimize this plan using a third-party 
optimizer. May be there could be something like substrait-optimizer in future 
(I really don't know). And use this optimized plan to create the Acero plan 
again.
{quote}
I'm also assuming that this will exist at some point. What I'm saying is that 
it's unlikely to optimize for Acero-specific things; it will optimize Substrait 
core relations and functions using the properties associated with those by the 
spec, such as order maintenance. If an optimal Substrait plan then degrades in 
performance just by converting it back to Acero's format due to architectural 
differences between Acero and Substrait, you're still going to need to do some 
basic optimizations afterward.
{quote}Yes, if Acero expects to inherit all these core features of the database 
it must do what they suppose to do, no argument there. Since Acero is an 
streaming execution engine, how far are we reaching for those goals are not yet 
clear to me. But at the end of the day, if we are benchmarking our performance 
with other systems, it would be the best to support such features as optimized 
as possible.
{quote}
It's fair enough if anything that doesn't perfectly fit a streaming paradigm is 
a non-goal for Acero, but then you can't support much of core Substrait as it's 
currently defined, and a generic Substrait optimizer would certainly make a 
mess. At that point I would question what we're (ab)using Substrait for; you 
can't have your cake and eat it too. If we just want to use Substrait as a 
simple means of serializing Acero plans and only have a flawed appearance of 
compatibility with systems that don't explicitly support Acero's dialect, we 
could have saved ourselves a whole lot of trouble (protobuf linking issues, 
anyone?) by just rolling our own format from the start. For some level of 
compatibility we could then have rolled out a converter from Substrait to 
Acero's serialization format outside of Arrow, if only to quarantine protobuf 
from the rest of libarrow.

Substrait core is intended to be a more or less minimal subset of what should 
be expected from an execution engine. If we can't or don't want to meet those 
expectations because we want more flexibility to optimize, we should propose to 
change Substrait, and if those changes are rejected by the community as being 
too Arrow-specific, IMO Substrait is just not for us. We should at least not be 
treating it as a first-class citizen for connecting Acero to Ibis and other 
query APIs in that case.

> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
>                 Key: ARROW-17183
>                 URL: https://issues.apache.org/jira/browse/ARROW-17183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Reply via email to