Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10335#issuecomment-165879022
Hi, @hvanhovell ,
Thank you for your comments! Regarding the benchmarking, I do not have a
good way to measure them. So far, `collect()` is not a good way when the
workload scale is huge. It will cause a large scale of data movement, I guess.
The performance number is misleading when you compare RDD Range API and
Dataframe Range API.
When we using LogicalRDD-based solution, our default join type will not
choose broadcast will use has a potential issue
I just tried your suggested method. It is 2 times slower when we using
`count()`. I have to increase the workload scale to `1000000000`. When the
scale is small, it is hard to do the performance compare since the result could
be affected by many factors.
```
scala> val startTime = System.currentTimeMillis;
sqlContext.logicalRDD_Range(0, 1000000000, 1, 15).count(); val endTime =
System.currentTimeMillis; val elapsed = (endTime - startTime)/ 1000.0
startTime: Long = 1450466566418
endTime: Long = 1450466581232
elapsed: Double = 14.814
scala> val startTime = System.currentTimeMillis;
sqlContext.logicalRDD_Range(0, 1000000000, 1, 15).count(); val endTime =
System.currentTimeMillis; val elapsed = (endTime - startTime)/ 1000.0
startTime: Long = 1450466583781
endTime: Long = 1450466597751
elapsed: Double = 13.97
scala> val startTime = System.currentTimeMillis; sqlContext.range(0,
1000000000, 1, 15).count(); val endTime = System.currentTimeMillis; val elapsed
= (endTime - startTime)/ 1000.0
startTime: Long = 1450466600825
endTime: Long = 1450466608397
elapsed: Double = 7.572
scala> val startTime = System.currentTimeMillis; sqlContext.range(0,
1000000000, 1, 15).count(); val endTime = System.currentTimeMillis; val elapsed
= (endTime - startTime)/ 1000.0
startTime: Long = 1450466611679
endTime: Long = 1450466619421
elapsed: Double = 7.742
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]