Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

via GitHub Tue, 09 Apr 2024 13:09:26 -0700


felipecrv commented on issue #40301:
URL: https://github.com/apache/arrow/issues/40301#issuecomment-2045967729


   > I'm honestly not sure what this has to do with work stealing.
   
   It's the inspiration for this, but if it seems to make the understanding 
harder, just ignore it.
   
   > What is the concern? The total number of allocations? The max allocated 
working set during the workload?
   
   The practical concern is that `jemalloc` and `mimalloc` allocate significant 
amounts of thread-local data:
   
   > For performance reasons, 
[jemalloc](https://engineering.fb.com/2011/01/03/core-infra/scalable-memory-allocation-using-jemalloc/)
 and https://github.com/microsoft/mimalloc/issues/351 maintain allocations on a 
per-memory segment level to reduce contention between threads.
   
   ----
   
   > Yes, I agree this could just be fixed with a heuristic, especially as 
parallelizing will not be performant on very small data.
   
   And that is exactly what I'm doing, but instead of looking at number of rows 
I look at how long it takes to convert the columns on average, while doing the 
conversions. Never blocking [1] or sleeping.
   
   It's an adaptive solution: you pass a duration, if it takes more than that 
duration, a few more threads are allowed to start.
   
   ```cpp
         // If the average wall-clock time per task is greater than the
         // small_task_duration_secs, grow the number of workers geometrically.
         if (static_cast<double>(elapsed_time.count() * 
steady_clock::period::num) >
             (small_task_duration_secs * steady_clock::period::den) * 
num_tasks_completed) {
           num_workers = std::min(
               ((num_workers * GrowthRatio::num - 1) / GrowthRatio::den) + 1, 
max_workers);
         }
     ```
   
   And instead of having these issues for every kind of parallelization we do 
in the future, this function can be used on any problem where the size of the 
tasks can range from trivial to very expensive.
   
   This what the fix looks like at the callsite:
   
   ```diff
   -    return OptionalParallelFor(options_.use_threads, num_columns_, 
WriteColumn);
   +    if (options_.use_threads) {
   +      const double kSmallTaskDurationSecs = 0.008;  // 8ms
   +      return ::arrow::internal::ParallelForWithBackOff(num_columns_, 
WriteColumn,
   +                                                       
kSmallTaskDurationSecs);
   +    } else {
   +      for (int i = 0; i < num_columns_; ++i) {
   +        RETURN_NOT_OK(WriteColumn(i));
   +      }
   +      return Status::OK();
   +    }
   ```
   
   [1] except for the lock-free reading of the `next_task` atomic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

Reply via email to