westonpace commented on issue #15264:
URL: https://github.com/apache/arrow/issues/15264#issuecomment-1377333536

   > My requirement is not to use multithreading, but when I set use_threads to 
false, if batch_readahead * batch_size > the number of rows in a row_group 
which to be read, it will read multiple row_groups at the same time in the 
parquetFileFragment. This means that when I turn use_threads off, multithreaded 
reads still occur in my code. My existing code doesn't support multithreading, 
so it reads errors.
   
   Most effort on threading has been to avoid using multiple threads for 
compute.  In 11.0.0 the threading controls for execution plans are improving 
and it should be easier to avoid using additional threads for compute.
   
   However, there has not been much effort in avoiding multiple threads for 
read.  Is there a particular reason you want this?  I/O routines are blocking 
and I/O threads are typically going to be stuck in a waiting state for most of 
their life, not using up many resources.  If you are reading from an HDD then 
you might not see much of a performance hit.  However, just about any other 
type of disk will benefit from at least some parallel I/O.
   
   One option you could try would be to change the I/O thread pool size to 1 
using `arrow::io::SetIOThreadPoolCapacity`.
   
   The fix is probably:
   
   ```
   diff --git a/cpp/src/arrow/dataset/file_parquet.cc 
b/cpp/src/arrow/dataset/file_parquet.cc
   index 0d95e1817..10a1ac8ce 100644
   --- a/cpp/src/arrow/dataset/file_parquet.cc
   +++ b/cpp/src/arrow/dataset/file_parquet.cc
   @@ -471,6 +471,9 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
                                  reader, row_groups, column_projection,
                                  ::arrow::internal::GetCpuThreadPool(), 
rows_to_readahead));
        RecordBatchGenerator sliced = SlicingGenerator(std::move(generator), 
batch_size);
   +    if (batch_readahead == 0) {
   +      return sliced;
   +    }
        RecordBatchGenerator sliced_readahead =
            MakeSerialReadaheadGenerator(std::move(sliced), batch_readahead);
        return sliced_readahead;
   ```
   
   Are you able to try this patch out?
   
   I will have to get some tests added to get this merged in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to