[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...

zsxwing Wed, 02 Sep 2015 20:07:20 -0700

Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/7417#issuecomment-137310991
  
    @Sephiroth-Lin there are two changes in your patch: using 
`BroadcastNestedLoopJoin` for the small table, and putting the small table in 
the left side of `RDD.cartesian`. I guess you only tested the first change. 
Could you do some performance test for another change?
    
    According to my understanding of `RDD.cartesian`, putting the small table 
in the left side of `RDD.cartesian` won't reduce IO. The record number that 
needs to be scanned is always: `#left_partitions * #right_partitions * 
#records_in_one_left_partition * #records_in_one_right_partition` = 
`#left_records * #right_records`. One benefit I can image is reducing the 
number of `opening/closing file`. 
    
    However, putting the small table in the right side of `RDD.cartesian` could 
take advantage of OS cache, considering it may be small enough to stay in the 
OS cache totally.
    
    Therefore I'm really curious about the performance improvement of the 
`RDD.cartesian` change.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...

Reply via email to