nfsantos commented on code in PR #1202: URL: https://github.com/apache/jackrabbit-oak/pull/1202#discussion_r1394308354
########## oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedSortBatchTask.java: ########## @@ -116,48 +136,90 @@ public Result call() throws Exception { } } + private void buildSortArray(NodeStateEntryBatch nseb) { + Stopwatch startTime = Stopwatch.createStarted(); + ByteBuffer buffer = nseb.getBuffer(); + int totalPathSize = 0; + while (buffer.hasRemaining()) { + int positionInBuffer = buffer.position(); + // Read the next key from the buffer + int pathLength = buffer.getInt(); + totalPathSize += pathLength; + if (pathLength > copyBuffer.length) { + LOG.debug("Resizing copy buffer from {} to {}", copyBuffer.length, pathLength); + copyBuffer = new byte[pathLength]; + } + buffer.get(copyBuffer, 0, pathLength); + // Skip the json + int entryLength = buffer.getInt(); + buffer.position(buffer.position() + entryLength); + // Create the sort key + String path = new String(copyBuffer, 0, pathLength, StandardCharsets.UTF_8); + String[] pathSegments = SortKey.genSortKeyPathElements(path); Review Comment: I don't think this would be faster. I tried a similar idea when optimizing the merge phase, when reading lines from the intermediate sorted files. It was about the same speed to read the segments of the path directly one by one than to read the full path as a string and then break it into segments, even though in the second option we are allocating an extra String. I was very surprised. I think the reason is that with the path as a String, we can use the `String.indexOf` method, which is an intrinsic and likely heavily optimized. If we read the path segments from the byte buffer, we have to iterate byte by byte in the buffer until finding the `/` characters, then build a string. This will very likely be slower than just reading the bytes with the path in a single operation from the byte buffer and then breaking the path string into segments. Furthermore, the String with the path is very short lived, so the JIT is likely applying escape analysis to allocate it on the stack. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org