yihua commented on code in PR #12323:
URL: https://github.com/apache/hudi/pull/12323#discussion_r1857401875
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##########
@@ -443,12 +443,6 @@ private Dataset<Row>
readRecordsForGroupAsRow(JavaSparkContext jsc,
.toArray(StoragePath[]::new);
HashMap<String, String> params = new HashMap<>();
- if (hasLogFiles) {
- params.put("hoodie.datasource.query.type", "snapshot");
- } else {
- params.put("hoodie.datasource.query.type", "read_optimized");
- }
Review Comment:
The config itself has the default value defined but as you pointed out it's
not honored through the read path with file paths provided (which the
clustering execution uses).
```
val QUERY_TYPE: ConfigProperty[String] = ConfigProperty
.key("hoodie.datasource.query.type")
.defaultValue(QUERY_TYPE_SNAPSHOT_OPT_VAL)
.withAlternatives("hoodie.datasource.view.type")
.withValidValues(QUERY_TYPE_SNAPSHOT_OPT_VAL,
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, QUERY_TYPE_INCREMENTAL_OPT_VAL)
.withDocumentation("Whether data needs to be read, in `" +
QUERY_TYPE_INCREMENTAL_OPT_VAL + "` mode (new data since an instantTime) " +
"(or) `" + QUERY_TYPE_READ_OPTIMIZED_OPT_VAL + "` mode (obtain latest
view, based on base files) (or) `" + QUERY_TYPE_SNAPSHOT_OPT_VAL + "` mode " +
"(obtain latest view, by merging base and (if any) log files)")
```
Also, after checking the code again, when the file paths are provided
through "hoodie.datasource.read.paths", the relation-based read path is used
(i.e., `useNewParquetFileFormat` is `false`). We can keep this for now.
Filed HUDI-8576 as a follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]