Re: [PR] OAK-10541 - Pipelined strategy: improve memory management of transform stage [jackrabbit-oak]

via GitHub Wed, 15 Nov 2023 06:49:22 -0800


nfsantos commented on code in PR #1202:
URL: https://github.com/apache/jackrabbit-oak/pull/1202#discussion_r1394308354



##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedSortBatchTask.java:
##########
@@ -116,48 +136,90 @@ public Result call() throws Exception {
         }
     }
 
+    private void buildSortArray(NodeStateEntryBatch nseb) {
+        Stopwatch startTime = Stopwatch.createStarted();
+        ByteBuffer buffer = nseb.getBuffer();
+        int totalPathSize = 0;
+        while (buffer.hasRemaining()) {
+            int positionInBuffer = buffer.position();
+            // Read the next key from the buffer
+            int pathLength = buffer.getInt();
+            totalPathSize += pathLength;
+            if (pathLength > copyBuffer.length) {
+                LOG.debug("Resizing copy buffer from {} to {}", 
copyBuffer.length, pathLength);
+                copyBuffer = new byte[pathLength];
+            }
+            buffer.get(copyBuffer, 0, pathLength);
+            // Skip the json
+            int entryLength = buffer.getInt();
+            buffer.position(buffer.position() + entryLength);
+            // Create the sort key
+            String path = new String(copyBuffer, 0, pathLength, 
StandardCharsets.UTF_8);
+            String[] pathSegments = SortKey.genSortKeyPathElements(path);

Review Comment:
   I don't think this would be faster. I tried a similar idea when optimizing 
the merge phase, when reading lines from the intermediate sorted files. It was 
about the same speed to read the segments of the path directly one by one than 
to read the full path as a string and then break it into segments, even though 
in the second option we are allocating an extra String. I was very surprised. I 
think the reason is that with the path as a String, we can use the 
`String.indexOf` method, which is an intrinsic and likely heavily optimized. If 
we read the path segments from the byte buffer, we have to iterate byte by byte 
in the buffer until finding the `/` characters, then build a string. This will 
very likely be slower than just reading the bytes with the path in a single 
operation from the byte buffer and then breaking the path string into segments. 
Furthermore, the String with the path is very short lived, so the JIT is likely 
applying escape analysis to allocate it on the stack.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] OAK-10541 - Pipelined strategy: improve memory management of transform stage [jackrabbit-oak]

Reply via email to