alexeykudinkin commented on a change in pull request #4026:
URL: https://github.com/apache/hudi/pull/4026#discussion_r756314329



##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##########
@@ -160,41 +160,92 @@ case class HoodieFileIndex(
       
spark.sessionState.conf.getConfString(DataSourceReadOptions.ENABLE_DATA_SKIPPING.key(),
 "false")).toBoolean
   }
 
-  private def filterFilesByDataSkippingIndex(dataFilters: Seq[Expression]): 
Set[String] = {
-    var allFiles: Set[String] = Set.empty
-    var candidateFiles: Set[String] = Set.empty
+  /**
+   * Computes pruned list of candidate base-files' names based on provided 
list of {@link dataFilters}
+   * conditions, by leveraging custom Z-order index (Z-index) bearing "min", 
"max", "num_nulls" statistic
+   * for all clustered columns
+   *
+   * NOTE: This method has to return complete set of candidate files, since 
only provided candidates will
+   *       ultimately be scanned as part of query execution. Hence, this 
method has to maintain the
+   *       invariant of conservatively including every base-file's name, that 
is NOT referenced in its index.
+   *
+   * @param dataFilters list of original data filters passed down from 
querying engine
+   * @return list of pruned (data-skipped) candidate base-files' names
+   */
+  private def lookupCandidateFilesNamesInZIndex(dataFilters: Seq[Expression]): 
Option[Set[String]] = {

Review comment:
       Are you suggesting to drop Z-index in its name? Here we actually read 
from Z-index (folder name of which is also `.zindex`) so we'd neutralize those 
things across the board
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to