[
https://issues.apache.org/jira/browse/ARROW-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368826#comment-17368826
]
Jayjeet Chakraborty commented on ARROW-13161:
---------------------------------------------
Thanks a lot for the quick response. We are using the
[DisposableScannerAdaptor|https://github.com/apache/arrow/blob/master/cpp/src/jni/dataset/jni_wrapper.cc#L140]
in the JNI bridge for integrating Arrow with Spark. We use the Dataset API
call to read a single file (not a collection of files). We want to map Sparks
iterator with an iterator that the Dataset API provides via JNI. In the Arrow
layer, The
[DisposalbleScannerAdaptor::Create|https://github.com/apache/arrow/blob/master/cpp/src/jni/dataset/jni_wrapper.cc#L146]
calls the
[ScanBatches|https://github.com/apache/arrow/blob/master/cpp/src/jni/dataset/jni_wrapper.cc#L148]
method from SyncScanner. AFAIK, the ScanBatches method is supposed to be lazy
and return an iterator and batches will be read out only when Next() is called
on the iterator. But currently, since the fragment readahead is > 0, our only
fragment (note we have a single file per dataset API call) is read out before
we want and we don't want that since that breaks the iterator mapping to Spark.
We just want ScanBatches to return a TaggedRecordBatchIterator. I hope that
clears our requirement a little bit.
> Allow setting FragmentReadahead to 0 in ScannerBuilder
> ------------------------------------------------------
>
> Key: ARROW-13161
> URL: https://issues.apache.org/jira/browse/ARROW-13161
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jayjeet Chakraborty
> Priority: Major
>
> I have an application where I need to set fragment readahead to 0. But, looks
> like for some reason the ScannerBuilder does not allow setting the fragment
> readahead to 0 [1]. It would be very helpful to know why it is that way and
> if a PR lifting that restriction would be accepted because a docstring
> mentions that users can set fragment readahead to 0 if they want [2].
> [1]https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L864
> [2]https://github.com/apache/arrow/blob/998a2a1668ea57a49d85fbb38f7f0e7eb94c29db/cpp/src/arrow/dataset/scanner.h#L93
--
This message was sent by Atlassian Jira
(v8.3.4#803005)