Want to avoid groupByKey as its running for ever

2015-06-30 Thread ๏̯͡๏
I have a RDD of type (String, Iterable[(com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord, com.ebay.ep.poc.spark.reporting.process.model.DataRecord)])] Here String is Key and a list of tuples for that key. I got above RDD after doing a groupByKey. I later want to compute

Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread Daniel Siegmann
If the number of items is very large, have you considered using probabilistic counting? The HyperLogLogPlus https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java class from stream-lib https://github.com/addthis/stream-lib

Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread ๏̯͡๏
I modified to detailInputsToGroup.map { case (detailInput, dataRecord) = val key: StringBuilder = new StringBuilder dimensions.foreach { dimension = key ++= {