[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

a-roberts Tue, 06 Dec 2016 05:58:36 -0800

Github user a-roberts commented on the issue:

    https://github.com/apache/spark/pull/15736
  
    New data for us, inlined comparator scores here (code provided below to 
check I've not profiled something useless!):
    ```
    ScalaSparkPagerank 2016-12-05 13:44:41 259928115            48.149          
     5398411              5398411
    ScalaSparkPagerank 2016-12-05 13:46:43 259928115            46.897          
     5542531              5542531
    ScalaSparkPagerank 2016-12-05 13:48:46 259928115            49.130          
     5290619              5290619
    ScalaSparkPagerank 2016-12-05 13:50:49 259928115            49.793          
     5220173              5220173
    ScalaSparkPagerank 2016-12-05 13:52:50 259928115            48.061          
     5408296              5408296
    ScalaSparkPagerank 2016-12-05 13:54:52 259928115            46.468          
     5593701              5593701
    ScalaSparkPagerank 2016-12-05 13:56:56 259928115            51.385          
     5058443              5058443
    ScalaSparkPagerank 2016-12-05 13:58:59 259928115            47.857          
     5431349              5431349
    ScalaSparkPagerank 2016-12-05 14:00:59 259928115            46.515          
     5588049              5588049
    ScalaSparkPagerank 2016-12-05 14:03:03 259928115            47.791          
     5438850              5438850
    Avg 48.2046s
    ```
    
    Remember our "vanilla" average time is 47.752s and our first commit 
averaged 47.229s (so not much of a difference really).
    
    I think we're splitting hairs and I've got another PR I am seeing good 
results on that I plan to focus on instead: the SizeEstimator. 
    
    This is what I've benchmarked, PartitionedAppendOnlyMap first, so let me 
know if there any further suggestions, otherwise I propose leaving this one for 
later as actually against the Spark master codebase I'm not noticing anything 
exciting.
    
    ```
      def partitionedDestructiveSortedIterator(keyComparator: 
Option[Comparator[K]])
        : Iterator[((Int, K), V)] = {
        val comparator = {
          if (keyComparator.isDefined) {
            val theKeyComp = keyComparator.get
            new Comparator[(Int, K)] {
              // We know we have a non-empty comparator here
              override def compare(a: (Int, K), b: (Int, K)): Int = {
                if (a._1 != b._1) {
                  a._1 - b._1
                } else {
                 theKeyComp.compare(a._2, b._2)
                }
              }
            }
          } else {
            new Comparator[(Int, K)] {
              override def compare(a: (Int, K), b: (Int, K)): Int = {
                a._1 - b._1
              }
            }
          }
        }
        destructiveSortedIterator(comparator)
      }
    ```
    
    In PartitionedPairBuffer
    
    ```
    
      /** Iterate through the data in a given order. For this class this is not 
really destructive. */
      override def partitionedDestructiveSortedIterator(keyComparator: 
Option[Comparator[K]])
        : Iterator[((Int, K), V)] = {
        val comparator = {
          if (keyComparator.isDefined) {
            val theKeyComp = keyComparator.get
            new Comparator[(Int, K)] {
              // We know we have a non-empty comparator here
              override def compare(a: (Int, K), b: (Int, K)): Int = {
                if (a._1 != b._1) {
                  a._1 - b._1
                } else {
                 theKeyComp.compare(a._2, b._2)
                }
              }
            }
          } else {
            new Comparator[(Int, K)] {
              override def compare(a: (Int, K), b: (Int, K)): Int = {
                a._1 - b._1
              }
            }
          }
        }
        new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, 
curSize, comparator)
        iterator
      }
    ```
    
    WritablePartitionedPairCollection remains unchanged.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

Reply via email to