[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation

GitBox Thu, 15 Aug 2019 04:35:51 -0700

wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables 
support fallback to HDFS for size estimation
URL: https://github.com/apache/spark/pull/24715#issuecomment-521611355
 
 
   I did some benchmark.
   
   Prepare data:
   ```scala
   
spark.range(100000000).repartition(10000).write.saveAsTable("test_non_partition_10000")
   
spark.range(100000000).repartition(300000).write.saveAsTable("test_non_partition_300000")
   spark.range(100000000).selectExpr("id", "id % 5000 as c2", "id as 
c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_5000")
   spark.range(100000000).selectExpr("id", "id % 10000 as c2", "id as 
c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_10000")
   ```
   Add these lines to 
[LogicalRelation.computeStats](https://github.com/apache/spark/blob/950d407f2b22f1ae088c55cba3d0081c3c1ecff9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L44):
   ```scala
   val time1 = System.currentTimeMillis()
   val relationSize = relation.sizeInBytes
   val time2 = System.currentTimeMillis()
   val fallBackToHdfsSize = 
CommandUtils.getSizeInBytesFallBackToHdfs(relation.sqlContext.sparkSession, 
catalogTable.get)
   val time3 = System.currentTimeMillis()
   // scalastyle:off
   println(s"Get size from relation: $relationSize, time: ${time2 - time1}")
   println(s"Get size fall back to HDFS: $fallBackToHdfsSize, time: ${time3 - 
time2}")
   // scalastyle:on
   ```
   
   Non-partitioned table benchmark result:
   ```
   scala> spark.sql("explain cost select * from test_non_partition_10000 limit 
1").show
   Get size from relation: 576588171, time: 22
   Get size fall back to HDFS: 576588171, time: 41
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   
   
   scala> spark.sql("explain cost select * from test_non_partition_10000 limit 
1").show
   Get size from relation: 576588171, time: 3
   Get size fall back to HDFS: 576588171, time: 28
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   
   
   scala>
   
   scala> spark.sql("explain cost select * from test_non_partition_300000 limit 
1").show
   Get size from relation: 706507984, time: 135
   Get size fall back to HDFS: 706507984, time: 2038
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   
   
   scala> spark.sql("explain cost select * from test_non_partition_300000 limit 
1").show
   Get size from relation: 706507984, time: 168
   Get size fall back to HDFS: 706507984, time: 3629
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   ```
   
   Partitioned table benchmark result:
   ```
   scala> spark.sql("explain cost select * from test_partition_5000 limit 
1").show
   Get size from relation: 9223372036854775807, time: 0
   Get size fall back to HDFS: 1018560794, time: 46
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   
   
   scala> spark.sql("explain cost select * from test_partition_10000 limit 
1").show
   Get size from relation: 9223372036854775807, time: 0
   Get size fall back to HDFS: 1036799332, time: 43
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   ```
   
   Partitioned table with `spark.sql.hive.manageFilesourcePartitions=false` 
(set it by --conf) benchmark result:
   ```
   scala> spark.sql("set spark.sql.hive.manageFilesourcePartitions").show
   +--------------------+-----+
   |                 key|value|
   +--------------------+-----+
   |spark.sql.hive.ma...|false|
   +--------------------+-----+
   
   
   scala> spark.sql("explain cost select * from test_partition_5000 limit 
1").show
   Get size from relation: 1018560794, time: 3
   Get size fall back to HDFS: 1018560794, time: 45
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   
   
   scala> spark.sql("explain cost select * from test_partition_10000 limit 
1").show
   Get size from relation: 1036799332, time: 865
   Get size fall back to HDFS: 1036799332, time: 69
   +--------------------+
   |                plan|
   +--------------------+
   |== Optimized Logi...|
   +--------------------+
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation

Reply via email to