Github user tbertelsen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6454#discussion_r31255908
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -580,7 +580,18 @@ abstract class RDD[T: ClassTag](
        * elements (a, b) where a is in `this` and b is in `other`.
        */
       def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    -    new CartesianRDD(sc, this, other)
    +    val numPartitions = {
    +      val numExecutors = System.getProperty("spark.executor.instances")
    +
    +      if (numExecutors != null) {
    +        numExecutors.toInt
    +      } else {
    +        sc.defaultMinPartitions
    +      }
    +    }
    +    val coalesced = coalesce(numPartitions, shuffle = true)
    +    val coalescedOther = other.coalesce(numPartitions, shuffle = true)
    +    new CartesianRDD(sc, coalesced, coalescedOther)
    --- End diff --
    
    What type of cluster have to test this on? I expect this will perform 
poorly if your cluster have few machines with a lot of cores, for example two 
machines with 32  cores each. We will ideally like the product of the number of 
partitions in the 2 RDDs to be equal to default parallelism (that is the total 
number of cores). If we do a shuffle anyway we should be able to calculate the 
number of partitions pretty exact.
    
    What will happen if we coalesce with shuffle set to false? For example: if 
an RDD has 16 partitions on 4 nodes and we coalesced to six partitions?
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to