[GitHub] [jackrabbit-oak] fabriziofortino commented on a diff in pull request #886: OAK-10171: datastore-copy cmd: add checksum validation

via GitHub Mon, 03 Apr 2023 08:47:31 -0700


fabriziofortino commented on code in PR #886:
URL: https://github.com/apache/jackrabbit-oak/pull/886#discussion_r1156144426



##########
oak-run/src/main/java/org/apache/jackrabbit/oak/run/Downloader.java:
##########
@@ -147,14 +163,49 @@ public ItemResponse call() throws Exception {
             sourceUrl.setConnectTimeout(Downloader.this.connectTimeoutMs);
             sourceUrl.setReadTimeout(Downloader.this.readTimeoutMs);
 
+            // Updating a MessageDigest from multiple threads is not thread 
safe, so we cannot reuse a single instance.
+            // Creating a new instance is a lightweight operation, no need to 
increase complexity by creating a pool.
+            MessageDigest md = null;
+            if (Downloader.this.checksumAlgorithm != null && item.checksum != 
null) {
+                md = 
MessageDigest.getInstance(Downloader.this.checksumAlgorithm);
+            }
+
             Path destinationPath = Paths.get(item.destination);
             Files.createDirectories(destinationPath.getParent());
-            long size;
+            long size = 0;
             try (ReadableByteChannel byteChannel = 
Channels.newChannel(sourceUrl.getInputStream());
                  FileOutputStream outputStream = new 
FileOutputStream(destinationPath.toFile())) {
-                size = outputStream.getChannel()
-                        .transferFrom(byteChannel, 0, Long.MAX_VALUE);
+                ByteBuffer buffer = ByteBuffer.allocate(1024);
+                int bytesRead;
+                while ((bytesRead = byteChannel.read(buffer)) != -1) {
+                    buffer.flip();
+                    if (md != null) {
+                        md.update(buffer);
+                    }
+                    size += bytesRead;
+                    outputStream.getChannel().write(buffer);
+                    buffer.clear();
+                }

Review Comment:
   > I'm wondering if the performance of doing the transfer with buffers in the 
Java heap will not be significantly worse than using `transferFrom`, which in 
principle can use low-level mechanisms of the JVM/OS.
   
   Low-level copy mechanisms are possible only for copies between local 
channels. When the source is remote, a similar buffer copy is used:
   
https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/sun/nio/ch/FileChannelImpl.java#L547-L581
   
   > The buffer of 1024 seems small, could hurt performance. I guess the buffer 
here should be large enough to read a large chunk of the TPC receive buffer. 
I'm not sure what is the default size, but it is for sure larger than 1024 
bytes. I found this which suggests that the buffer is close to 80KB. Maybe here 
it is better to increase to 16KB or even 32KB, to reduce the CPU overhead.
   
   It was small indeed. I have executed several runs (see results in the main 
description) and I see some small improvements when the buffer size is larger. 
The optimal buffer size for downloading and checksumming a file depends on a 
variety of factors, such as the size of the file being downloaded, the 
available memory of the system, the network bandwidth, and other 
system-specific factors.
   
   The `buffer-size` is now configurable. The default is 8k which should be a 
safe default buffer size to use. It can be changed if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [jackrabbit-oak] fabriziofortino commented on a diff in pull request #886: OAK-10171: datastore-copy cmd: add checksum validation

Reply via email to