Scala doesn't support ranges >= Int.MaxValue https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Range.scala?utf8=✓#L89
You can create two RDDs and unionize them: scala> val rdd = sc.parallelize(1L to Int.MaxValue.toLong).union(sc.parallelize(1L to Int.MaxValue.toLong)) rdd: org.apache.spark.rdd.RDD[Long] = UnionRDD[10] at union at <console>:24 scala> rdd.count [Stage 0:> (0 + 4) / 8] Also instead of creating the range on the driver, you can create your RDD in parallel: scala> :paste // Entering paste mode (ctrl-D to finish) val numberOfParts = 100 val numberOfElementsInEachPart = Int.MaxValue.toDouble / 100 val rdd = sc.parallelize(1 to numberOfParts).flatMap(partNum => { val begin = ((partNum - 1) * numberOfElementsInEachPart).toLong val end = (partNum * numberOfElementsInEachPart).toLong begin to end }) // Exiting paste mode, now interpreting. numberOfParts: Int = 100 numberOfElementsInEachPart: Double = 2.147483647E7 rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[15] at flatMap at <console>:31 scala> rdd.count res10: Long = 2147483747 On Tue, Aug 8, 2017 at 1:26 PM, makoto <tokomakoma...@gmail.com> wrote: > Hello, > I'd like to count more than Int.MaxValue. But I encountered the following > error. > > scala> val rdd = sc.parallelize(1L to Int.MaxValue*2.toLong) > rdd: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[28] at > parallelize at <console>:24 > > scala> rdd.count > java.lang.IllegalArgumentException: More than Int.MaxValue elements. > at scala.collection.immutable.NumericRange$.check$1( > NumericRange.scala:304) > at scala.collection.immutable.NumericRange$.count( > NumericRange.scala:314) > at scala.collection.immutable.NumericRange.numRangeElements$ > lzycompute(NumericRange.scala:52) > at scala.collection.immutable.NumericRange.numRangeElements( > NumericRange.scala:51) > at scala.collection.immutable.NumericRange.length(NumericRange.scala:54) > at org.apache.spark.rdd.ParallelCollectionRDD$.slice( > ParallelCollectionRDD.scala:145) > at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions( > ParallelCollectionRDD.scala:97) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > > How can I avoid the error ? > A similar problem is as follows: > scala> rdd.reduce((a,b)=> (a + b)) > java.lang.IllegalArgumentException: More than Int.MaxValue elements. > at scala.collection.immutable.NumericRange$.check$1( > NumericRange.scala:304) > at scala.collection.immutable.NumericRange$.count( > NumericRange.scala:314) > at scala.collection.immutable.NumericRange.numRangeElements$ > lzycompute(NumericRange.scala:52) > at scala.collection.immutable.NumericRange.numRangeElements( > NumericRange.scala:51) > at scala.collection.immutable.NumericRange.length(NumericRange.scala:54) > at org.apache.spark.rdd.ParallelCollectionRDD$.slice( > ParallelCollectionRDD.scala:145) > at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions( > ParallelCollectionRDD.scala:97) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2119) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:151) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008) > ... 48 elided > > >