xiarixiaoyao commented on a change in pull request #4013:
URL: https://github.com/apache/hudi/pull/4013#discussion_r754048363



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -142,6 +142,16 @@
       .sinceVersion("0.9.0")
       .withDocumentation("When rewriting data, preserves existing 
hoodie_commit_time");
 
+  /**
+   * Using space-filling curves to optimize the layout of table to boost query 
performance.
+   * The table data which sorted by space-filling curve has better 
aggregation; combine with min-max filtering, it can achieve good performance 
improvement.

Review comment:
       ok
   

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
##########
@@ -283,17 +283,37 @@ public Boolean apply(String recordKey) {
 
   /**
    * Parse min/max statistics stored in parquet footers for all columns.
+   * ParquetRead.readFooter is not a thread safe method.
+   *
+   * @param conf hadoop conf.
+   * @param parquetFilePath file to be read.
+   * @param cols cols which need to collect statistics.
+   * @param useLock if use lock when read parquet footer.
+   * @return a HoodieColumnRangeMetadata instance.
    */
-  public Collection<HoodieColumnRangeMetadata<Comparable>> 
readRangeFromParquetMetadata(Configuration conf, Path parquetFilePath, 
List<String> cols) {
-    ParquetMetadata metadata = readMetadata(conf, parquetFilePath);
+  public Collection<HoodieColumnRangeMetadata<Comparable>> 
readRangeFromParquetMetadata(
+      Configuration conf,
+      Path parquetFilePath,
+      List<String> cols,
+      boolean useLock) {

Review comment:
       We get the statistics of the specified column by reading footer.
   
   However, if the specified column is of datetye type, parquet uses 
simpledateformat to format data(this is not thread safe) when returning the 
min-max value. In the case of multithreading, this will lead to the return of 
the wrong min max value.
   
   Let me add a test case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to