Github user bersprockets commented on a diff in the pull request:
https://github.com/apache/spark/pull/21950#discussion_r218608537
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
---
@@ -1051,11 +1052,27 @@ private[hive] object HiveClientImpl {
// When table is external, `totalSize` is always zero, which will
influence join strategy.
// So when `totalSize` is zero, use `rawDataSize` instead. When
`rawDataSize` is also zero,
// return None.
+ // If a table has a deserialization factor, the table owner expects
the in-memory
+ // representation of the table to be larger than the table's totalSize
value. In that case,
+ // multiply totalSize by the deserialization factor and use that
number instead.
+ // If the user has set spark.sql.statistics.ignoreRawDataSize to true
(because of HIVE-20079,
+ // for example), don't use rawDataSize.
// In Hive, when statistics gathering is disabled, `rawDataSize` and
`numRows` is always
// zero after INSERT command. So they are used here only if they are
larger than zero.
- if (totalSize.isDefined && totalSize.get > 0L) {
- Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount =
rowCount.filter(_ > 0)))
- } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
+ val factor = try {
+ properties.get("deserFactor").getOrElse("1.0").toDouble
--- End diff --
I need to eliminate this duplication: There's a similar lookup and
calculation done in PruneFileSourcePartitionsSuite. Also, I should check if a
Long value, used as an intermediate value, is acceptable to hold file sizes
(possibly, since a Long can represent 8 exabytes)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]