wangyum commented on a change in pull request #24003:
[SPARK-19678][FOLLOW-UP][SQL] Add behavior change test when table statistics
are incorrect
URL: https://github.com/apache/spark/pull/24003#discussion_r263395794
##########
File path:
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
##########
@@ -110,6 +110,41 @@ class StatisticsSuite extends
StatisticsCollectionTestBase with TestHiveSingleto
}
}
+ test("Hive serde table with incorrect statistics") {
+ withTempDir { tempDir =>
+ withTable("t1") {
+
spark.range(5).write.mode(SaveMode.Overwrite).parquet(tempDir.getCanonicalPath)
+ val dataSize =
tempDir.listFiles.filter(!_.getName.endsWith(".crc")).map(_.length).sum
+ spark.sql(
+ s"""
+ |CREATE EXTERNAL TABLE t1(id BIGINT)
+ |ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
+ |STORED AS
+ | INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
+ | OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
+ |LOCATION '${tempDir.getCanonicalPath}'
+ |TBLPROPERTIES (
+ |'rawDataSize'='-1', 'numFiles'='0', 'totalSize'='0',
+ |'COLUMN_STATS_ACCURATE'='false', 'numRows'='-1'
+ |)""".stripMargin)
+
+ spark.sql("REFRESH TABLE t1")
+ // Before SPARK-19678, sizeInBytes should be equal to dataSize.
+ // After SPARK-19678, sizeInBytes should be equal to
DEFAULT_SIZE_IN_BYTES.
+ val relation1 = spark.table("t1").queryExecution.analyzed.children.head
+ assert(relation1.stats.sizeInBytes ===
spark.sessionState.conf.defaultSizeInBytes)
Review comment:
Before SPARK-19678. It dataSize(397) no matter
`ENABLE_FALL_BACK_TO_HDFS_FOR_STATS` is true or false.
The computeStats code path if disable `ENABLE_FALL_BACK_TO_HDFS_FOR_STATS`:
https://github.com/apache/spark/blob/dc2a4d4ad478fdb0486cc0515d4fe8b402d24db4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala#L76
After SPARK-19678. it's dataSize(397) if enable
`ENABLE_FALL_BACK_TO_HDFS_FOR_STATS`, otherwise it's `DEFAULT_SIZE_IN_BYTES`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]