[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

GitBox Tue, 23 Feb 2021 19:19:05 -0800


pengzhiwei2018 commented on a change in pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#discussion_r581588514




##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -377,11 +388,71 @@ private[hudi] object HoodieSparkSqlWriter {
     hiveSyncConfig.supportTimestamp = 
parameters.get(HIVE_SUPPORT_TIMESTAMP).exists(r => r.toBoolean)
     hiveSyncConfig.decodePartition = 
parameters.getOrElse(URL_ENCODE_PARTITIONING_OPT_KEY,
       DEFAULT_URL_ENCODE_PARTITIONING_OPT_VAL).toBoolean
+    hiveSyncConfig.tableProperties = 
parameters.getOrElse(HIVE_TABLE_PROPERTIES, null)
+    hiveSyncConfig.serdeProperties = createSqlTableSerdeProperties(parameters, 
basePath.toString,
+      hiveSyncConfig.partitionFields.size())
     hiveSyncConfig
   }
 
-  private def metaSync(parameters: Map[String, String],
-                       basePath: Path,
+  /**
+    * Add Spark Sql related table properties to the HIVE_TABLE_PROPERTIES.
+    * @param sqlConf
+    * @param schema
+    * @param parameters
+    * @return A new parameters added the HIVE_TABLE_PROPERTIES property.
+    */
+  private def addSqlTableProperties(sqlConf: SQLConf, schema: StructType,
+                                    parameters: Map[String, String]): 
Map[String, String] = {
+    val partitionSet = parameters(HIVE_PARTITION_FIELDS_OPT_KEY)
+      .split(",").map(_.trim).filter(!_.isEmpty).toSet
+    val threshold = sqlConf.getConf(SCHEMA_STRING_LENGTH_THRESHOLD)
+
+    val (partCols, dataCols) = schema.partition(c => 
partitionSet.contains(c.name))
+    val reOrdered = StructType(dataCols ++ partCols)
+    val parts = reOrdered.json.grouped(threshold).toSeq
+
+    var properties = Map(
+      "spark.sql.sources.provider" -> "hudi",
+      "spark.sql.sources.schema.numParts" -> parts.size.toString
+    )
+    parts.zipWithIndex.foreach { case (part, index) =>
+      properties += s"spark.sql.sources.schema.part.$index" -> part
+    }
+    // add partition columns
+    if (partitionSet.nonEmpty) {
+      properties += "spark.sql.sources.schema.numPartCols" -> 
partitionSet.size.toString
+      partitionSet.zipWithIndex.foreach { case (partCol, index) =>
+        properties += s"spark.sql.sources.schema.partCol.$index" -> partCol
+      }
+    }
+    var sqlPropertyText = ConfigUtils.configToString(properties)
+    sqlPropertyText = if (parameters.containsKey(HIVE_TABLE_PROPERTIES)) {
+      sqlPropertyText + "\n" + parameters(HIVE_TABLE_PROPERTIES)
+    } else {
+      sqlPropertyText
+    }
+    parameters + (HIVE_TABLE_PROPERTIES -> sqlPropertyText)
+  }
+
+  private def createSqlTableSerdeProperties(parameters: Map[String, String],
+                                            basePath: String, pathDepth: Int): 
String = {
+    assert(pathDepth >= 0, "Path Depth must great or equal to 0")
+    var pathProp = s"path=$basePath"
+    if (pathProp.endsWith("/")) {
+      pathProp = pathProp.substring(0, pathProp.length - 1)
+    }
+    for (_ <- 0 until pathDepth + 1) {
+      pathProp = s"$pathProp/*"

Review comment:
       Hi @nsivabalan @vinothchandar @n3nash
   Currently I append "*" to the path for spark because hudi need the "*" when 
query hudi table. The number of "*" is the same with the size of partition path 
fields.
   However, there are some problems when we save the path with "*" to the meta 
store.
   1、If we have not enable the encode for partition value (by 
`URL_ENCODE_PARTITIONING_OPT_KEY`), the size of partition fields may not equal 
to the actually generated partition directory level(e.g. pt = "2021/02/03", 
size of partition fields is 1, but the generated partition directory level is 
3). In this case, we cannot know the number of "*" to append to the path for 
spark in the compile time.
   2、Problems for spark sql on hoodie.
   If the we save the path contains "*" to the meta store, we will have 
problems with the `Insert` and `Refresh Table` commands. Because `Insert` and 
`Refresh Table` cannot allow "*" in the path.
   
   So I think we need to support No Stars Query for hoodie table. After support 
no stars query, we does not need to append "*" to the path. And all the 
problems above will be solved.
   I have file a issue to solve this: 
[HUDI-1591](https://issues.apache.org/jira/browse/HUDI-1591)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

Reply via email to