Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22758#discussion_r226192210 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -193,6 +193,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log None) val logicalRelation = cached.getOrElse { val updatedTable = inferIfNeeded(relation, options, fileFormat) + // Intialize the catalogTable stats if its not defined.An intial value has to be defined --- End diff -- Thanks for your valuable feedback. My observations : 1) In insert flow we are always trying to update the HiveStats as per the below statement in InsertIntoHadoopFsRelationCommand. ``` if (catalogTable.nonEmpty) { CommandUtils.updateTableStats(sparkSession, catalogTable.get) } ``` but after create table command, when we do insert command within the same session Hive statistics is not getting updated due to below validation where condition expects stats to be non-empty as below ``` def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = { if (table.stats.nonEmpty) { ``` But if we re-launch spark-shell and trying to do insert command the Hivestatistics will be saved and now onward the stats will be taken from HiveStats and the flow will never try to estimate the data size with file . 2) Currently always system is not trying to estimate the data size with files when we are executing the insert command, as i told above if we launch the query from a new context , system will try to read the stats from the Hive. i think there is a problem in the behavior consistency and also if we can always get the stats from hive then shall we need to calculate again eveytime the stats from files? >> I think we may need to update the flow where it shall always try read the data size from files, it shall never depend on HiveStats, >> Or if we are recording the HiveStats then everytime it shall read the Hivestats. Please let me know whether i am going right direction, let me know for any clarifications.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org