Weston Pace created ARROW-11800:
-----------------------------------

             Summary: [C++] Add unordered scan
                 Key: ARROW-11800
                 URL: https://issues.apache.org/jira/browse/ARROW-11800
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


Currently Scan generates an ordered sequence of batches.  However, this is not 
ideal.  For example, consider reading four parquet files in parallel from S3.  
There is no good way to determine which read will finish first.  If file 2 
finishes before file 1 then we could start parsing the contents of file 2 
immediately but we currently do not.

There could then be an option provided by Scan whether to preserve ordering or 
not.  Cases that do not care about ordering (e.g. count rows) could take 
advantage of this to reduce memory pressure.

Note: This will be an optimization even for cases that do care about ordering.  
We could still parse / project / etc. out of order and simply reorder at the 
end.  The only difference between unordered and ordered then will be the memory 
pressure applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to