[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

a-roberts Fri, 25 Nov 2016 15:52:23 -0800

Github user a-roberts commented on the issue:

    https://github.com/apache/spark/pull/15736
  
    I've conducted a lot of performance tests and gathered .hcd files so I can 
investigate this next week, but it looks like either the first commit is the 
best for performance or my current configuration with this benchmark results in 
us being unable to infer if our changes really make a difference.
    
    Sharing some raw data, the format is as follows.
    
    Benchmark name, date, time, data size in bytes (the same each run), the 
elapsed time and the throughput (bytes per second).
    
    **With the above suggestions for Partitioned*Buffer**
    ```
    ScalaSparkPagerank 2016-11-25 18:49:23 259928115            49.577          
     5242917              
    ScalaSparkPagerank 2016-11-25 18:56:55 259928115            49.946          
     5204182              
    ScalaSparkPagerank 2016-11-25 19:00:04 259928115            46.510          
     5588650              
    ScalaSparkPagerank 2016-11-25 19:02:23 259928115            49.018          
     5302707              
    ScalaSparkPagerank 2016-11-25 19:05:25 259928115            49.270          
     5275585              
    ```
    
    **Vanilla, no changes at all**
    ```
    ScalaSparkPagerank 2016-11-25 19:08:45 259928115            48.068          
     5407508              
    ScalaSparkPagerank 2016-11-25 19:11:20 259928115            47.712          
     5447856              
    ScalaSparkPagerank 2016-11-25 19:13:50 259928115            44.517          
     5838850              
    ScalaSparkPagerank 2016-11-25 19:16:07 259928115            49.942          
     5204599              
    ScalaSparkPagerank 2016-11-25 19:19:08 259928115            48.521          
     5357023              
    ```
    
    **Original commit**
    ```
    ScalaSparkPagerank 2016-11-25 19:47:59 259928115            45.486          
     5714464              
    ScalaSparkPagerank 2016-11-25 19:50:48 259928115            48.507          
     5358569              
    ScalaSparkPagerank 2016-11-25 19:53:09 259928115            47.063          
     5522982              
    ScalaSparkPagerank 2016-11-25 19:56:58 259928115            46.154          
     5631757              
    ScalaSparkPagerank 2016-11-25 20:00:01 259928115            48.935          
     5311701        
    ```
    
    In Healthcenter I do see that these methods are still great candidates for 
optimisation as they are all very commonly used.
    
    Open to more suggestions, I have exclusive access to lots of hardware, can 
easily churn out more custom builds and have lots of profiling software we can 
use. I'll be committing code for the SizeEstimator soon as that's a good 
candidate for optimisation here as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

Reply via email to