Re: [PR] OAK-11114 - Filter downloaded Mongo documents by path suffix [jackrabbit-oak]

via GitHub Mon, 16 Sep 2024 05:16:38 -0700


nfsantos commented on code in PR #1716:
URL: https://github.com/apache/jackrabbit-oak/pull/1716#discussion_r1761036072



##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java:
##########
@@ -49,44 +56,106 @@
  *   <li>Allows estimating the size of the document while reading it, which 
will have a negligible overhead (as compared
  *   with doing an additional traverse of the object structure to compute the 
size).</li>
  * </ul>
- *
+ * <p>
  * This class must be thread-safe, Mongo uses a single coded implementation 
across multiple threads.
- *
  */
 public class NodeDocumentCodec implements Codec<NodeDocument> {
+    private final static Logger LOG = 
LoggerFactory.getLogger(NodeDocumentCodec.class);
+
+    public static final String 
OAK_INDEXER_PIPELINED_NODE_DOCUMENT_FILTER_FILTERED_PATH = 
"oak.indexer.pipelined.nodeDocument.filter.filteredPath";
+    public static final String 
OAK_INDEXER_PIPELINED_NODE_DOCUMENT_FILTER_SUFFIXES_TO_SKIP = 
"oak.indexer.pipelined.nodeDocument.filter.suffixesToSkip";
+    private final String filteredPath = 
ConfigHelper.getSystemPropertyAsString(OAK_INDEXER_PIPELINED_NODE_DOCUMENT_FILTER_FILTERED_PATH,
 "");
+    private final List<String> suffixesToSkip = 
ConfigHelper.getSystemPropertyAsStringList(OAK_INDEXER_PIPELINED_NODE_DOCUMENT_FILTER_SUFFIXES_TO_SKIP,
 "", ';');
+
     // The estimated size is stored in the NodeDocument itself
     public final static String SIZE_FIELD = "_ESTIMATED_SIZE_";
+
+    private static class NodeDocumentDecoderContext {
+        long docsDecoded = 0;
+        long dataDownloaded = 0;
+        int estimatedSizeOfCurrentObject = 0;
+    }
+
+    private final NodeDocument emptyNodeDocument;
+
     private final MongoDocumentStore store;
     private final Collection<NodeDocument> collection;
     private final BsonTypeCodecMap bsonTypeCodecMap;
     private final DecoderContext decoderContext = 
DecoderContext.builder().build();
-
     private final Codec<String> stringCoded;
     private final Codec<Long> longCoded;
     private final Codec<Boolean> booleanCoded;
 
+    private final NodeDocumentFilter fieldFilter = new 
NodeDocumentFilter(filteredPath, suffixesToSkip);
+
+    // Statistics
+    private final AtomicLong totalDocsDecoded = new AtomicLong(0);
+    private final AtomicLong totalDataDownloaded = new AtomicLong(0);
+    private final ThreadLocal<NodeDocumentDecoderContext> perThreadContext = 
ThreadLocal.withInitial(NodeDocumentDecoderContext::new);
+
     public NodeDocumentCodec(MongoDocumentStore store, 
Collection<NodeDocument> collection, CodecRegistry defaultRegistry) {
         this.store = store;
         this.collection = collection;
         this.bsonTypeCodecMap = new BsonTypeCodecMap(new BsonTypeClassMap(), 
defaultRegistry);
+        this.emptyNodeDocument = collection.newDocument(store);
         // Retrieve references to the most commonly used codecs, to avoid the 
map lookup in the common case
         this.stringCoded = (Codec<String>) 
bsonTypeCodecMap.get(BsonType.STRING);
         this.longCoded = (Codec<Long>) bsonTypeCodecMap.get(BsonType.INT64);
         this.booleanCoded = (Codec<Boolean>) 
bsonTypeCodecMap.get(BsonType.BOOLEAN);
     }
 
+    /**
+     * Skipping over values in the BSON file is faster than reading them. 
Skipping is done by advancing a pointer in
+     * an internal buffer, while reading requires converting them to a Java 
data type (typically String).
+     */
+    private void skipUntilEndOfDocument(BsonReader reader) {
+        while (reader.readBsonType() != BsonType.END_OF_DOCUMENT) {
+            reader.skipName();
+            reader.skipValue();
+        }
+        reader.readEndDocument();
+    }
+
     @Override
     public NodeDocument decode(BsonReader reader, DecoderContext 
decoderContext) {
         NodeDocument nodeDocument = collection.newDocument(store);
-        MutableInt estimatedSizeOfCurrentObject = new MutableInt(0);
+        NodeDocumentDecoderContext threadLocalContext = perThreadContext.get();
+        threadLocalContext.estimatedSizeOfCurrentObject = 0;
         reader.readStartDocument();
         while (reader.readBsonType() != BsonType.END_OF_DOCUMENT) {
             String fieldName = reader.readName();
-            Object value = readValue(reader, fieldName, 
estimatedSizeOfCurrentObject);
+            Object value = readValue(reader, fieldName, threadLocalContext);
+            // Once we read the _id or the _path, apply the filter
+            if (fieldName.equals(NodeDocument.ID) || 
fieldName.equals(NodeDocument.PATH)) {
+                if (fieldFilter.shouldSkip(fieldName, (String) value)) {

Review Comment:
   There shouldn't be, but it's a good point. The filtering is a best-effort 
performance optimization, not needed for correctness, so it should never cause 
the download to fail. I added a type check to value (which also covers the null 
case) before calling the filter method.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OAK-11114 - Filter downloaded Mongo documents by path suffix [jackrabbit-oak]

Reply via email to