[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

via GitHub Mon, 03 Apr 2023 22:32:50 -0700


codope commented on code in PR #8303:
URL: https://github.com/apache/hudi/pull/8303#discussion_r1156740671



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -270,6 +271,21 @@ object DefaultSource {
     }
   }
 
+  private def resolveHoodieBootstrapRelation(sqlContext: SQLContext,
+                                             globPaths: Seq[Path],
+                                             userSchema: Option[StructType],
+                                             metaClient: HoodieTableMetaClient,
+                                             parameters: Map[String, String]): 
BaseRelation = {
+    val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, 
sqlContext.sparkSession.sessionState.conf,
+      ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+    if (!enableFileIndex || globPaths.nonEmpty || 
parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != 
"true") {

Review Comment:
   I think we should do away with the config and rely on the condition here to 
decide whether or not to use the fast read path (which should be done by 
default). Wdyt?



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala:
##########
@@ -807,7 +807,9 @@ class TestHoodieSparkSqlWriter {
         .option("hoodie.insert.shuffle.parallelism", "4")
         .mode(SaveMode.Append).save(tempBasePath)
 
-      val currentCommits = 
spark.read.format("hudi").load(tempBasePath).select("_hoodie_commit_time").take(1).map(_.getString(0))
+      val currentCommits = spark.read.format("hudi")
+        .option(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key, "false")

Review Comment:
   Need more tests. Setting it to `false` does not test the changed code path.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -180,6 +180,22 @@ case class HoodieFileIndex(spark: SparkSession,
     }
   }
 
+  /**
+   * In the fast bootstrap read code path, it gets the file status for the 
bootstrap base files instead of
+   * skeleton files.
+   */
+  private def getBaseFileStatus(baseFiles: mutable.Buffer[HoodieBaseFile]): 
mutable.Buffer[FileStatus] = {
+    if (shouldFastBootstrap) {
+     return baseFiles.map(f =>
+        if (f.getBootstrapBaseFile.isPresent) {
+         f.getBootstrapBaseFile.get().getFileStatus

Review Comment:
   Why do we need to guard this by `shouldFastBootstrap` conditional? Shouldn't 
we always return the source file status if it's present>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -83,10 +83,18 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
   /**
    * Get the schema of the table.
    */
-  lazy val schema: StructType = schemaSpec.getOrElse({
-    val schemaUtil = new TableSchemaResolver(metaClient)
-    
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
-  })
+  lazy val schema: StructType = if (shouldFastBootstrap) {
+      StructType(rawSchema.fields.filterNot(f => 
HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name)))

Review Comment:
   just import the static member `HOODIE_META_COLUMNS_WITH_OPERATION` instead 
of importing full `HoodieRecord`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

Reply via email to