Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22698#discussion_r224484333
--- Diff: sql/core/benchmarks/RangeBenchmark-results.txt ---
@@ -0,0 +1,16 @@
+================================================================================================
+range
+================================================================================================
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
+
+range: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
+------------------------------------------------------------------------------------------------
+full scan 12674 / 12840 41.4
24.2 1.0X
+limit after range 33 / 37 15900.2
0.1 384.4X
+filter after range 969 / 985 541.0
1.8 13.1X
+count after range 42 / 42 12510.5
0.1 302.4X
+count after limit after range 32 / 33 16337.0
0.1 394.9X
--- End diff --
several learnings:
1. limit does help
2. The performance is bad if we interrupt the data processing loop too
often. Full scan is the worst case, we interrupt the loop for every record.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]