[GitHub] spark pull request: [Spark-12374][SPARK-12150][SQL] Adding logical...

gatorsmile Fri, 18 Dec 2015 11:38:10 -0800

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10335#issuecomment-165879022
  
    Hi, @hvanhovell , 
    
    Thank you for your comments! Regarding the benchmarking, I do not have a 
good way to measure them. So far, `collect()` is not a good way when the 
workload scale is huge. It will cause a large scale of data movement, I guess. 
The performance number is misleading when you compare RDD Range API and 
Dataframe Range API. 
    
    When we using LogicalRDD-based solution, our default join type will not 
choose broadcast will use  has a potential issue 
    
    I just tried your suggested method. It is 2 times slower when we using 
`count()`. I have to increase the workload scale to `1000000000`. When the 
scale is small, it is hard to do the performance compare since the result could 
be affected by many factors. 
    
    ```
    scala> val startTime = System.currentTimeMillis; 
sqlContext.logicalRDD_Range(0, 1000000000, 1, 15).count(); val endTime = 
System.currentTimeMillis; val elapsed = (endTime - startTime)/ 1000.0
    startTime: Long = 1450466566418                                             
    
    endTime: Long = 1450466581232
    elapsed: Double = 14.814
    
    scala> val startTime = System.currentTimeMillis; 
sqlContext.logicalRDD_Range(0, 1000000000, 1, 15).count(); val endTime = 
System.currentTimeMillis; val elapsed = (endTime - startTime)/ 1000.0
    startTime: Long = 1450466583781                                             
    
    endTime: Long = 1450466597751
    elapsed: Double = 13.97
    
    scala> val startTime = System.currentTimeMillis; sqlContext.range(0, 
1000000000, 1, 15).count(); val endTime = System.currentTimeMillis; val elapsed 
= (endTime - startTime)/ 1000.0
    startTime: Long = 1450466600825                                             
    
    endTime: Long = 1450466608397
    elapsed: Double = 7.572
    
    scala> val startTime = System.currentTimeMillis; sqlContext.range(0, 
1000000000, 1, 15).count(); val endTime = System.currentTimeMillis; val elapsed 
= (endTime - startTime)/ 1000.0
    startTime: Long = 1450466611679                                             
    
    endTime: Long = 1450466619421
    elapsed: Double = 7.742
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [Spark-12374][SPARK-12150][SQL] Adding logical...

Reply via email to