[GitHub] [spark] attilapiros commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats

GitBox Wed, 20 Nov 2019 11:43:46 -0800

attilapiros commented on a change in pull request #26016: [SPARK-24914][SQL] 
New statistic to improve data size estimate for columnar storage formats
URL: https://github.com/apache/spark/pull/26016#discussion_r348708931


 ##########
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
 ##########
 @@ -1186,10 +1186,17 @@ private[hive] object HiveClientImpl {
     // return None.
     // In Hive, when statistics gathering is disabled, `rawDataSize` and 
`numRows` is always
     // zero after INSERT command. So they are used here only if they are 
larger than zero.
+    val deserFactor = properties.get(STATISTICS_DESER_FACTOR).map(_.toInt)
     if (totalSize.isDefined && totalSize.get > 0L) {
-      Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = 
rowCount.filter(_ > 0)))
+      Some(CatalogStatistics(
+        sizeInBytes = totalSize.get,
+        deserFactor = deserFactor,
+        rowCount = rowCount.filter(_ > 0)))
     } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
-      Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = 
rowCount.filter(_ > 0)))
+      Some(CatalogStatistics(
+        sizeInBytes = rawDataSize.get,
 
 Review comment:
   In this case (when only `rawDataSize` is defined) I will set the 
`deserFactor` to `None` to  avoid the extra scaling as `rawDataSize` is already 
the "approximate size of data in memory". 
   
   The Hive 1.2 value you are referring to is probably a hive bug.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] attilapiros commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats

Reply via email to