[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Vibhatha Lakmal Abeykoon (Jira) Mon, 25 Jul 2022 03:10:22 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570818#comment-17570818
 ]


Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--------------------------------------------------

[~jvanstraten] this is a good discussion. 

If Acero is going to tightly coupled to Substrait, we may need to think about 
all these points carefully and clearly. But I imagined Acero and Substriat to 
be orthogonal as far as implementation wise and decision making, but if it is 
the goal to support Substrait in a very deeper level, these points are very 
important in re-thinking to improve if any features are missing.

Keeping an index column is an interesting idea and making it optional via 
extensions could possibly be a good way to handle ordering. But I am not quite 
sure, if that has to be a feature of Acero or an attribute of the table itself. 
AFAIK Arrow doesn't support indexing data. So if we are just to keep a column 
for the sake of guaranteeing the order, that's something to think about in the 
long run. Are we attaching it when we read data from the source and omit it 
after doing the final operation (at sink). It is a viable idea, but where to 
implement is a question. When rows are dropped re-indexing mechanisms and how 
to handle them could be a separate topic itself. cc [~westonpace] 

Another thing I am not quite sure about is the query optimization. For now I 
assumed that what Acero is going to digest is an optimized plan, meaning we get 
an optimized Substrait plan (but I am not sure if this is going to be the 
practical case). Should a built Acero exec-plan be optimized before running 
internally? I guess that would be an important feature, but I am not quite sure 
if we have a plan for this kind of an implementation.

The Fetch nodes's current goal is to fetch a set of records with or without 
ordering and the current implementation in the created PR is a sub-optimal 
solution and it doesn't do anything special to guarantee the ordering or do an 
optimized fetch.

> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
>                 Key: ARROW-17183
>                 URL: https://issues.apache.org/jira/browse/ARROW-17183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

Reply via email to