Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19479#discussion_r149851930
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1024,21 +1024,36 @@ private[spark] class HiveExternalCatalog(conf:
SparkConf, hadoopConf: Configurat
stats: CatalogStatistics,
schema: StructType): Map[String, String] = {
- var statsProperties: Map[String, String] =
- Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString())
+ val statsProperties = new mutable.HashMap[String, String]()
+ statsProperties += STATISTICS_TOTAL_SIZE ->
stats.sizeInBytes.toString()
if (stats.rowCount.isDefined) {
statsProperties += STATISTICS_NUM_ROWS ->
stats.rowCount.get.toString()
}
+ // In Hive metastore, the length of value in table properties cannot
be larger than 4000.
+ // We need to split the key-value pair into multiple key-value
properties if the length of
+ // value exceeds this threshold.
+ val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
--- End diff --
do we still need this hack? I don't think histogram string can hit this
limitation. Creating too many buckets is non-sense.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]