Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/21950#discussion_r218608537 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -1051,11 +1052,27 @@ private[hive] object HiveClientImpl { // When table is external, `totalSize` is always zero, which will influence join strategy. // So when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero, // return None. + // If a table has a deserialization factor, the table owner expects the in-memory + // representation of the table to be larger than the table's totalSize value. In that case, + // multiply totalSize by the deserialization factor and use that number instead. + // If the user has set spark.sql.statistics.ignoreRawDataSize to true (because of HIVE-20079, + // for example), don't use rawDataSize. // In Hive, when statistics gathering is disabled, `rawDataSize` and `numRows` is always // zero after INSERT command. So they are used here only if they are larger than zero. - if (totalSize.isDefined && totalSize.get > 0L) { - Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0))) - } else if (rawDataSize.isDefined && rawDataSize.get > 0) { + val factor = try { + properties.get("deserFactor").getOrElse("1.0").toDouble --- End diff -- I need to eliminate this duplication: There's a similar lookup and calculation done in PruneFileSourcePartitionsSuite. Also, I should check if a Long value, used as an intermediate value, is acceptable to hold file sizes (possibly, since a Long can represent 8 exabytes)
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org