kamalbanga commented on issue #25449: [PYSPARK] Simpler countByValue using collections' Counter URL: https://github.com/apache/spark/pull/25449#issuecomment-521977070 I benchmarked it and the existing implementation is faster 🤦♂ ```python from pyspark import SparkContext, SparkConf from random import choices from collections import defaultdict, Counter from contextlib import contextmanager from operator import add import time MAX_NUM = int(1e9) RDD_SIZE = int(1e7) NUM_PARTITIONS = 4 @contextmanager def timethis(snippet): start = time.time() yield print(f'Duration of {snippet}: {(time.time() - start):.1f} seconds') def countval1(iterator): yield Counter(iterator) def countval2(iterator): counts = defaultdict(int) for k in iterator: counts[k] += 1 yield counts def mergeMaps(m1, m2): for k, v in m1.items(): m2[k] += v return m2 random_integers = choices(population=range(MAX_NUM), k=RDD_SIZE) with timethis('direct counter'): dagg1 = cntr(random_integers) with timethis('direct defaultdict'): dagg2 = dd(random_integers) sc = SparkContext(conf=SparkConf().setAppName('Benchmark')) random_rdd = sc.parallelize(random_integers, NUM_PARTITIONS) with timethis('Spark Counter'): agg1 = random_rdd.mapPartitions(countval1).reduce(add) with timethis('Spark defaultdict'): agg2 = random_rdd.mapPartitions(countval2).reduce(mergeMaps) ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org