jonvex commented on code in PR #8303:
URL: https://github.com/apache/hudi/pull/8303#discussion_r1165555978


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -270,6 +271,21 @@ object DefaultSource {
     }
   }
 
+  private def resolveHoodieBootstrapRelation(sqlContext: SQLContext,
+                                             globPaths: Seq[Path],
+                                             userSchema: Option[StructType],
+                                             metaClient: HoodieTableMetaClient,
+                                             parameters: Map[String, String]): 
BaseRelation = {
+    val enableFileIndex = HoodieSparkConfUtils.getConfigValue(parameters, 
sqlContext.sparkSession.sessionState.conf,
+      ENABLE_HOODIE_FILE_INDEX.key, 
ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean
+    if (!enableFileIndex || globPaths.nonEmpty || 
parameters.getOrElse(HoodieBootstrapConfig.DATA_QUERIES_ONLY.key(), "true") != 
"true") {

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-3896 I am not sure if this is the 
only optimization, but it is one of them. The query plans for non bootstrapped 
and bootstrap tables look pretty much identical except non bootstrap says 
"FileScan parquet" when reading and bootstrap reading says "scan 
HoodieBootstrapRelation"
   
   I started by comparing time to run tpcds queries on boostrapped tables vs 
non bootstrapped. For a full bootstrap, the runtime ratio was 1.997 and for a 
metadata only bootstrap it was 1.638.
   
   I thought that was surprising that the full bootstrap was so slow, so I 
tried to replicate what was being done in BaseFileOnlyRelation in the first 
commit in [this pr](https://github.com/apache/hudi/pull/8272). We create a 
HoodieFileScanRDD instead of a HoodieBootstrapRDD. The ratio of tpcds runtime 
compared to reading from a non bootstrap table was 1.48 for a full bootstrap 
table, and 1.35 for a metadata only bootstrap. 
    
    With the changes in this pr to leverage HadoopFsRelation the ratio was 1.12 
for metadata only bootstrap, and 1.09 for full bootstrap. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to