[GitHub] [spark] kamalbanga commented on issue #25449: [PYSPARK] Simpler countByValue using collections' Counter

GitBox Fri, 16 Aug 2019 04:24:20 -0700

kamalbanga commented on issue #25449: [PYSPARK] Simpler countByValue using 
collections' Counter
URL: https://github.com/apache/spark/pull/25449#issuecomment-521977070
 
 
   I benchmarked it and the existing implementation is faster 🤦‍♂ 
   ```python
   from pyspark import SparkContext, SparkConf
   from random import choices
   from collections import defaultdict, Counter
   from contextlib import contextmanager
   from operator import add
   import time
   
   MAX_NUM = int(1e9)
   RDD_SIZE = int(1e7)
   NUM_PARTITIONS = 4
   
   @contextmanager
   def timethis(snippet):
       start = time.time()
       yield
       print(f'Duration of {snippet}: {(time.time() - start):.1f} seconds')
   
   def countval1(iterator):
       yield Counter(iterator)
   
   def countval2(iterator):
       counts = defaultdict(int)
       for k in iterator:
           counts[k] += 1
       yield counts
   
   def mergeMaps(m1, m2):
       for k, v in m1.items():
           m2[k] += v
       return m2
   
   
   random_integers = choices(population=range(MAX_NUM), k=RDD_SIZE)
   
   with timethis('direct counter'):
       dagg1 = cntr(random_integers)
   
   with timethis('direct defaultdict'):
       dagg2 = dd(random_integers)
   
   sc = SparkContext(conf=SparkConf().setAppName('Benchmark'))
   random_rdd = sc.parallelize(random_integers, NUM_PARTITIONS)
   
   with timethis('Spark Counter'):
       agg1 = random_rdd.mapPartitions(countval1).reduce(add)
   
   with timethis('Spark defaultdict'):
       agg2 = random_rdd.mapPartitions(countval2).reduce(mergeMaps)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kamalbanga commented on issue #25449: [PYSPARK] Simpler countByValue using collections' Counter

Reply via email to