liupc commented on a change in pull request #21950: [SPARK-24914][SQL] Add 
configuration to avoid OOM during broadcast join (and other negative side 
effects of incorrect table sizing)
URL: https://github.com/apache/spark/pull/21950#discussion_r376190584
 
 

 ##########
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
 ##########
 @@ -1054,11 +1056,18 @@ private[hive] object HiveClientImpl {
     // When table is external, `totalSize` is always zero, which will 
influence join strategy.
     // So when `totalSize` is zero, use `rawDataSize` instead. When 
`rawDataSize` is also zero,
     // return None.
+    // If a table has a deserialization factor, the table owner expects the 
in-memory
+    // representation of the table to be larger than the table's totalSize 
value. In that case,
+    // multiply totalSize by the deserialization factor and use that number 
instead.
+    // If the user has set spark.sql.statistics.ignoreRawDataSize to true 
(because of HIVE-20079,
+    // for example), don't use rawDataSize.
     // In Hive, when statistics gathering is disabled, `rawDataSize` and 
`numRows` is always
     // zero after INSERT command. So they are used here only if they are 
larger than zero.
-    if (totalSize.isDefined && totalSize.get > 0L) {
-      Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = 
rowCount.filter(_ > 0)))
-    } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
+    val adjustedSize = DataSourceUtils.calcDataSize(properties, 
totalSize.getOrElse(BigInt(0)))
+    val sqlConf = SQLConf.get
+    if (adjustedSize > 0L) {
+      Some(CatalogStatistics(sizeInBytes = adjustedSize, rowCount = 
rowCount.filter(_ > 0)))
+    } else if (rawDataSize.isDefined && rawDataSize.get > 0 && 
!sqlConf.ignoreRawDataSize) {
 
 Review comment:
   we've already have `spark.sql.sources.fileCompressionFactor` used in 
`HadoopFsRelation`, we should skip `HiveTableStats` rule for datasource tables 
as described in [#27055](https://github.com/apache/spark/pull/27055)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to