[jira] [Commented] (ARROW-13795) [C++] Add async version of the ORC Dataset scanner

Weston Pace (Jira) Tue, 31 Aug 2021 13:05:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407630#comment-17407630
 ]


Weston Pace commented on ARROW-13795:
-------------------------------------

In that case a thread-task-per-batch-read approach is probably simpler, if you 
use the async generator stuff you can use the readahead generator to throttle 
the max # of concurrent reads outstanding.  It will keep the CPU threads from 
blocking and allow concurrent reads (admittedly, concurrent reads are only 
useful if you're on S3 and have a small # of files).

> [C++] Add async version of the ORC Dataset scanner
> --------------------------------------------------
>
>                 Key: ARROW-13795
>                 URL: https://issues.apache.org/jira/browse/ARROW-13795
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, orc
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but for now only implemented the 
> sync {{OrcFileFormat::ScanFile}}, while we should rather implemented 
> {{OrcFileFormat::ScanBatchesAsync}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13795) [C++] Add async version of the ORC Dataset scanner

Reply via email to