kamalbanga edited a comment on issue #25449: [PYSPARK] Simpler countByValue 
using collections' Counter
URL: https://github.com/apache/spark/pull/25449#issuecomment-521977070
 
 
   I benchmarked it and the existing implementation is faster 🤦‍♂ 
   ```python
   from pyspark import SparkContext, SparkConf
   from random import choices
   from collections import defaultdict, Counter
   from contextlib import contextmanager
   from operator import add
   import time
   
   MAX_NUM = int(1e9)
   RDD_SIZE = int(1e7)
   NUM_PARTITIONS = 4
   
   @contextmanager
   def timethis(snippet):
       start = time.time()
       yield
       print(f'Duration of {snippet}: {(time.time() - start):.1f} seconds')
   
   def countval1(iterator):
       yield Counter(iterator)
   
   def countval2(iterator):
       counts = defaultdict(int)
       for k in iterator:
           counts[k] += 1
       yield counts
   
   def mergeMaps(m1, m2):
       for k, v in m1.items():
           m2[k] += v
       return m2
   
   
   random_integers = choices(population=range(MAX_NUM), k=RDD_SIZE)
   
   sc = SparkContext(conf=SparkConf().setAppName('Benchmark'))
   random_rdd = sc.parallelize(random_integers, NUM_PARTITIONS)
   
   with timethis('Spark Counter'):
       agg1 = random_rdd.mapPartitions(countval1).reduce(add)
   
   with timethis('Spark defaultdict'):
       agg2 = random_rdd.mapPartitions(countval2).reduce(mergeMaps)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to