wesm commented on pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#issuecomment-887924786


   Some updated performance (gcc 9.3 locally on x86):
   
   ```
   
-------------------------------------------------------------------------------------
   Benchmark                           Time             CPU   Iterations 
UserCounters...
   
-------------------------------------------------------------------------------------
   BM_ExecBatchIterator/256     11314787 ns     11313272 ns           62 
items_per_second=88.3918/s
   BM_ExecBatchIterator/512      5670423 ns      5669598 ns          123 
items_per_second=176.379/s
   BM_ExecBatchIterator/1024     2903937 ns      2903272 ns          242 
items_per_second=344.439/s
   BM_ExecBatchIterator/2048     1461982 ns      1461711 ns          481 
items_per_second=684.13/s
   BM_ExecBatchIterator/4096      739382 ns       739235 ns          951 
items_per_second=1.35275k/s
   BM_ExecBatchIterator/8192      370264 ns       370207 ns         1892 
items_per_second=2.70119k/s
   BM_ExecBatchIterator/16384     186622 ns       186573 ns         3755 
items_per_second=5.35983k/s
   BM_ExecBatchIterator/32768      93581 ns        93567 ns         7437 
items_per_second=10.6876k/s
   ```
   
   The way to read this is that breaking `ExecBatch` with 32 primitive array 
fields into smaller ExecBatches (and then accessing a a data pointer in each 
batch) has an overhead of approximately:
   
   * 2800 nanoseconds per batch
   * 88.6 nanoseconds per batch per field
   
   So if you wanted to break a batch with 1M elements into batches of size 1024 
for finer-grained parallel processing, you would pay  2900 microseconds to do 
so. On this same machine, I have:
   
   ```
   In [2]: arr = np.random.randn(1 << 20)                                       
                                                                                
                                  
   
   In [3]: timeit arr * 2                                                       
                                                                                
                                  
   395 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   ```
   
   This seems problematic if we wish to enable array expression evaluation on 
smaller batch sizes to keep more data in CPU caches. I'll bring this up on the 
mailing list to see what people think. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to