westonpace commented on issue #36765:
URL: https://github.com/apache/arrow/issues/36765#issuecomment-1642531758

   Possibly.  I think there are two concerns users generally have.  Either they 
want "max speed" (use as little memory as possible but, if there is a 
speed/memory tradeoff, prefer speed) or they want "least memory" (if there is a 
speed/memory tradeoff, prefer memory)
   
   I tried to simplify things in #35889 (this is why I was running experiments 
a few weeks ago) and came up with:
   
   parquet::ReaderProperties
   ```
         case ParquetScanStrategy::kLeastMemory:
           properties.enable_buffered_stream();
           properties.set_buffer_size(8 * 1024 * 1024);
         case ParquetScanStrategy::kMaxSpeed:
           properties.disable_buffered_stream();
   ```
   
   parquet::ArrowReaderProperties
   ```
         case ParquetScanStrategy::kLeastMemory:
           properties.set_batch_size(acero::ExecPlan::kMaxBatchSize);
           properties.set_pre_buffer(false);
         case ParquetScanStrategy::kMaxSpeed:
           properties.set_batch_size(64 * 1024 * 1024);
           properties.set_pre_buffer(true);
           properties.set_cache_options(io::CacheOptions::LazyDefaults());
   ```
   
   I'm pretty sure that the `kLeastMemory` options use very low memory (even 
when scanning large files).  I wasn't convinced that `kMaxSpeed` was much 
faster (but I only tested local disks and not S3).
   
   I am very surprised to see a 10x difference due to pre-buffering.  I don't 
think any of our experiments (S3 or not) ever showed a difference that was that 
drastic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to