Re: [PR] OAK-10580 - Support multiple include paths in regex path filtering [jackrabbit-oak]

via GitHub Thu, 14 Dec 2023 05:39:17 -0800


nfsantos commented on code in PR #1249:
URL: https://github.com/apache/jackrabbit-oak/pull/1249#discussion_r1426731651



##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedMongoDownloadTask.java:
##########
@@ -353,63 +354,101 @@ private void downloadWithNaturalOrdering() throws 
InterruptedException, TimeoutE
         }
     }
 
-    private String getPathForRegexFiltering() {
+    private Set<String> getPathsForRegexFiltering() {
         if (!regexPathFiltering) {
             LOG.info("Regex path filtering disabled.");
-            return null;
+            return Set.of();
         }
-        return getSingleIncludedPath(pathFilters);
+        return extractIncludedPaths(pathFilters);
     }
 
+    /**
+     * Aggregates the included paths from the path filters. The final list 
will not contain duplicates or overlapping
+     * paths (i.e., /a and /a/b).
+     *
+     * @param pathFilters Empty set if path filtering should be disabled, 
otherwise the paths that should be included
+     *                    in the Mongo query filters
+     */
     // Package private for testing
-    static String getSingleIncludedPath(List<PathFilter> pathFilters) {
-        // For the time being, we only enable path filtering if there is a 
single include path across all indexes and no
-        // exclude paths. This is the case for most of the larger indexes. We 
can consider generalizing this in the future.
-        LOG.info("Creating regex filter from pathFilters: " + pathFilters);
+    static Set<String> extractIncludedPaths(List<PathFilter> pathFilters) {
+        // Path filtering is enabled only if there are no excludedPaths.
         if (pathFilters == null) {
-            return null;
+            return Set.of();
+        }
+        Set<String> includedPaths = new HashSet<>();
+        Set<String> excludedPaths = new HashSet<>();
+        for (PathFilter pathFilter : pathFilters) {
+            includedPaths.addAll(pathFilter.getIncludedPaths());
+            excludedPaths.addAll(pathFilter.getExcludedPaths());
+        }
+        // Sort by length to make it easier to compute the common ancestors of 
include paths
+        List<String> sortedIncludedPaths = includedPaths.stream()
+                .sorted()
+                .collect(Collectors.toList());
+
+        LOG.info("Paths considered for regex filtering. IncludedPaths: {}, 
excludedPaths: {}", sortedIncludedPaths, excludedPaths);
+        if (sortedIncludedPaths.contains("/")) {
+            LOG.info("Disabling regex path filtering because root path is in 
the includedPaths: {}", sortedIncludedPaths);
+            return Set.of();
+        }
+        if (!excludedPaths.isEmpty()) {
+            LOG.info("Disabling regex path filtering because there are 
excluded paths: {}", excludedPaths);
+            return Set.of();
         }
-        Set<String> includedPaths = pathFilters.stream()
-                .flatMap(pathFilter -> pathFilter.getIncludedPaths().stream())
-                .collect(Collectors.toSet());
-
-        Set<String> excludedPaths = pathFilters.stream()
-                .flatMap(pathFilter -> pathFilter.getExcludedPaths().stream())
-                .collect(Collectors.toSet());
 
-        if (excludedPaths.isEmpty() && includedPaths.size() == 1) {
-            return includedPaths.stream().iterator().next();
-        } else {
-            return null;
+        // Keep only unique include paths. That is, if paths "/a/b" and 
"/a/b/c" are both in the list, keep only "/a/b"
+        HashSet<String> includedPathsRoots = new HashSet<>();
+        for (String path : sortedIncludedPaths) {
+            if (includedPathsRoots.stream().noneMatch(ancestor -> 
PathUtils.isAncestor(ancestor, path))) {
+                includedPathsRoots.add(path);
+            }
         }
+        return includedPathsRoots;
     }
 
-    private static Bson descendantsFilter(String path) {
+    private static Bson descendantsFilter(Set<String> path) {
+        List<Bson> filters = path.stream()
+                .flatMap(PipelinedMongoDownloadTask::descendantsFilter)
+                .collect(Collectors.toList());
+        return Filters.or(filters);
+    }
+
+    private static Stream<Bson> descendantsFilter(String path) {
         if (!path.endsWith("/")) {
             path = path + "/";
         }
         String quotedPath = Pattern.quote(path);
         // For entries with path sizes above a certain threshold, the _id 
field contains a hash instead of the path of
         // the entry. The path is stored instead in the _path field. 
Therefore, we have to include in the filter also
         // the documents with matching _path.
-        return Filters.or(
+        return Stream.of(
                 regex(NodeDocument.ID, Pattern.compile("^[0-9]{1,3}:" + 
quotedPath + ".*$")),

Review Comment:
   Can you elaborate? 
   
   From what I understand, the document id will be one of these:
   - depth + ":" + path; (last line of the method that you refer)
   - an hash - as returned by `createHashedId(depth, hash, path.getName());`
   
   This is the full line that you quoted:
   
   ```
           return Stream.of(
                   regex(NodeDocument.ID, Pattern.compile("^[0-9]{1,3}:" + 
quotedPath + ".*$")),
                   regex(NodeDocument.PATH, Pattern.compile(quotedPath + ".*$")
                   ));
   ```
   
   If the id is in the first format, the regex filter on the `_id` will pick it 
up. If the id is an hash, then the second regex on `_path` will match the 
document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OAK-10580 - Support multiple include paths in regex path filtering [jackrabbit-oak]

Reply via email to