liupc commented on a change in pull request #21950: [SPARK-24914][SQL] Add
configuration to avoid OOM during broadcast join (and other negative side
effects of incorrect table sizing)
URL: https://github.com/apache/spark/pull/21950#discussion_r376190584
##########
File path:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
##########
@@ -1054,11 +1056,18 @@ private[hive] object HiveClientImpl {
// When table is external, `totalSize` is always zero, which will
influence join strategy.
// So when `totalSize` is zero, use `rawDataSize` instead. When
`rawDataSize` is also zero,
// return None.
+ // If a table has a deserialization factor, the table owner expects the
in-memory
+ // representation of the table to be larger than the table's totalSize
value. In that case,
+ // multiply totalSize by the deserialization factor and use that number
instead.
+ // If the user has set spark.sql.statistics.ignoreRawDataSize to true
(because of HIVE-20079,
+ // for example), don't use rawDataSize.
// In Hive, when statistics gathering is disabled, `rawDataSize` and
`numRows` is always
// zero after INSERT command. So they are used here only if they are
larger than zero.
- if (totalSize.isDefined && totalSize.get > 0L) {
- Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount =
rowCount.filter(_ > 0)))
- } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
+ val adjustedSize = DataSourceUtils.calcDataSize(properties,
totalSize.getOrElse(BigInt(0)))
+ val sqlConf = SQLConf.get
+ if (adjustedSize > 0L) {
+ Some(CatalogStatistics(sizeInBytes = adjustedSize, rowCount =
rowCount.filter(_ > 0)))
+ } else if (rawDataSize.isDefined && rawDataSize.get > 0 &&
!sqlConf.ignoreRawDataSize) {
Review comment:
we've already have `spark.sql.sources.fileCompressionFactor` used in
`HadoopFsRelation`, we should skip `HiveTableStats` rule for datasource tables
as described in [#27055](https://github.com/apache/spark/pull/27055)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]