[
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570867#comment-17570867
]
Jeroen van Straten edited comment on ARROW-17183 at 7/25/22 12:40 PM:
----------------------------------------------------------------------
{quote}IMHO, what would be better is to write the Acero query and convert it to
a Substrait plan, and then optimize this plan using a third-party optimizer.
May be there could be something like substrait-optimizer in future (I really
don't know). And use this optimized plan to create the Acero plan again.
{quote}
I'm also assuming that this will exist at some point. What I'm saying is that
it's unlikely to optimize for Acero-specific things; it will optimize Substrait
core relations and functions using the properties associated with those by the
spec, such as order maintenance. If an optimal Substrait plan then degrades in
performance just by converting it back to Acero's format due to architectural
differences between Acero and Substrait, you're still going to need to do some
basic optimizations afterward.
{quote}Yes, if Acero expects to inherit all these core features of the database
it must do what they suppose to do, no argument there. Since Acero is an
streaming execution engine, how far are we reaching for those goals are not yet
clear to me. But at the end of the day, if we are benchmarking our performance
with other systems, it would be the best to support such features as optimized
as possible.
{quote}
It's fair enough if anything that doesn't perfectly fit a streaming paradigm is
a non-goal for Acero, but then you can't support much of core Substrait as it's
currently defined, and a generic Substrait optimizer would certainly make a
mess. At that point I would question what we're (ab)using Substrait for; you
can't have your cake and eat it too. If we just want to use Substrait as a
simple means of serializing Acero plans and only have a flawed appearance of
compatibility with systems that don't explicitly support Acero's dialect, we
could have saved ourselves a whole lot of trouble (protobuf linking issues,
anyone?) by just rolling our own format from the start. For some level of
compatibility we could then have rolled out a converter from Substrait to
Acero's serialization format outside of Arrow, if only to quarantine protobuf
from the rest of libarrow.
Substrait core is intended to be a more or less minimal subset of what should
be expected from an execution engine. If we can't or don't want to meet those
expectations because we want more flexibility to optimize, we should propose to
change Substrait, and if those changes are rejected by the community as being
too Arrow-specific, IMO Substrait is just not for us. We should at least not be
treating it as a first-class citizen for connecting Acero to Ibis and other
query APIs in that case.
was (Author: JIRAUSER282962):
{quote}Acero. IMHO, what would be better is to write the Acero query and
convert it to a Substrait plan, and then optimize this plan using a third-party
optimizer. May be there could be something like substrait-optimizer in future
(I really don't know). And use this optimized plan to create the Acero plan
again.
{quote}
I'm also assuming that this will exist at some point. What I'm saying is that
it's unlikely to optimize for Acero-specific things; it will optimize Substrait
core relations and functions using the properties associated with those by the
spec, such as order maintenance. If an optimal Substrait plan then degrades in
performance just by converting it back to Acero's format due to architectural
differences between Acero and Substrait, you're still going to need to do some
basic optimizations afterward.
{quote}Yes, if Acero expects to inherit all these core features of the database
it must do what they suppose to do, no argument there. Since Acero is an
streaming execution engine, how far are we reaching for those goals are not yet
clear to me. But at the end of the day, if we are benchmarking our performance
with other systems, it would be the best to support such features as optimized
as possible.
{quote}
It's fair enough if anything that doesn't perfectly fit a streaming paradigm is
a non-goal for Acero, but then you can't support much of core Substrait as it's
currently defined, and a generic Substrait optimizer would certainly make a
mess. At that point I would question what we're (ab)using Substrait for; you
can't have your cake and eat it too. If we just want to use Substrait as a
simple means of serializing Acero plans and only have a flawed appearance of
compatibility with systems that don't explicitly support Acero's dialect, we
could have saved ourselves a whole lot of trouble (protobuf linking issues,
anyone?) by just rolling our own format from the start. For some level of
compatibility we could then have rolled out a converter from Substrait to
Acero's serialization format outside of Arrow, if only to quarantine protobuf
from the rest of libarrow.
Substrait core is intended to be a more or less minimal subset of what should
be expected from an execution engine. If we can't or don't want to meet those
expectations because we want more flexibility to optimize, we should propose to
change Substrait, and if those changes are rejected by the community as being
too Arrow-specific, IMO Substrait is just not for us. We should at least not be
treating it as a first-class citizen for connecting Acero to Ibis and other
query APIs in that case.
> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset.
> For instance pick 10 records skipping the first 5 elements. Here we can
> define this as a Slice operation and records can be easily extracted in a
> sink-node.
> Sort and Fetch operation applies when we need to execute a Fetch operation on
> sorted data. The main issue is we cannot have a sort node followed by a
> fetch. The reason is that all existing node definitions supporting sort are
> based on sink nodes. Since there cannot be a node followed by sink, this
> functionality has to take place in a single node.
> But this is not a perfect solution for fetch and sort, but one way to do this
> is define a sink node where the records are sorted and then a set of items
> are fetched.
> Another dilema is what if sort is followed by a fetch. In that case, there
> has to be a flag to enable the order of the operations.
> The objective of this ticket is to discuss a viable efficient solution and
> include new nodes or a method to execute such a logic.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)