[GitHub] [jackrabbit-oak] nfsantos commented on a diff in pull request #886: OAK-10171: datastore-copy cmd: add checksum validation

via GitHub Fri, 31 Mar 2023 08:53:31 -0700


nfsantos commented on code in PR #886:
URL: https://github.com/apache/jackrabbit-oak/pull/886#discussion_r1154621454



##########
oak-run/src/main/java/org/apache/jackrabbit/oak/run/Downloader.java:
##########
@@ -147,14 +163,49 @@ public ItemResponse call() throws Exception {
             sourceUrl.setConnectTimeout(Downloader.this.connectTimeoutMs);
             sourceUrl.setReadTimeout(Downloader.this.readTimeoutMs);
 
+            // Updating a MessageDigest from multiple threads is not thread 
safe, so we cannot reuse a single instance.
+            // Creating a new instance is a lightweight operation, no need to 
increase complexity by creating a pool.
+            MessageDigest md = null;
+            if (Downloader.this.checksumAlgorithm != null && item.checksum != 
null) {
+                md = 
MessageDigest.getInstance(Downloader.this.checksumAlgorithm);
+            }
+
             Path destinationPath = Paths.get(item.destination);
             Files.createDirectories(destinationPath.getParent());
-            long size;
+            long size = 0;
             try (ReadableByteChannel byteChannel = 
Channels.newChannel(sourceUrl.getInputStream());
                  FileOutputStream outputStream = new 
FileOutputStream(destinationPath.toFile())) {
-                size = outputStream.getChannel()
-                        .transferFrom(byteChannel, 0, Long.MAX_VALUE);
+                ByteBuffer buffer = ByteBuffer.allocate(1024);
+                int bytesRead;
+                while ((bytesRead = byteChannel.read(buffer)) != -1) {
+                    buffer.flip();
+                    if (md != null) {
+                        md.update(buffer);
+                    }
+                    size += bytesRead;
+                    outputStream.getChannel().write(buffer);
+                    buffer.clear();
+                }

Review Comment:
   I'm wondering if the performance of doing the transfer with buffers in the 
Java heap will not be significantly worse than using `transferFrom`, which in 
principle can use low-level mechanisms of the JVM/OS. In ideal conditions, the 
transferFrom could even be zero copy. Maybe it would be faster to still use 
`transferFrom` and then compute the checksum after the file is written? Or 
maybe the transferFrom is not much better than copying in the heap or we are 
limited by the network transfer and it does not really matter.
   
   The buffer of 1024 seems small, could hurt performance. I guess the buffer 
here should be large enough to read a large chunk of the TPC receive buffer. 
I'm not sure what is the default size, but it is for sure larger than 1024 
bytes. I found this which suggests that the buffer is close to 80KB. Maybe here 
it is better to increase to 16KB or even 32KB, to reduce the CPU overhead.
                
   https://man7.org/linux/man-pages/man7/tcp.7.html
   > tcp_rmem (since Linux 2.4)
   >  default
   >                      the default size of the receive buffer for a TCP
   >                      socket.  This value overwrites the initial default
   >                      buffer size from the generic global
   >                      net.core.rmem_default defined for all protocols.
   >                      The default value is 87380 bytes.  (On Linux 2.4,
   >                      this will be lowered to 43689 in low-memory
   >                      systems.)  If larger receive buffer sizes are
   >                      desired, this value should be increased (to affect
   >                      all sockets).  To employ large TCP windows, the
   >                      net.ipv4.tcp_window_scaling must be enabled
   >                      (default).
   



##########
oak-run/src/main/java/org/apache/jackrabbit/oak/run/Downloader.java:
##########
@@ -147,14 +163,49 @@ public ItemResponse call() throws Exception {
             sourceUrl.setConnectTimeout(Downloader.this.connectTimeoutMs);
             sourceUrl.setReadTimeout(Downloader.this.readTimeoutMs);
 
+            // Updating a MessageDigest from multiple threads is not thread 
safe, so we cannot reuse a single instance.
+            // Creating a new instance is a lightweight operation, no need to 
increase complexity by creating a pool.
+            MessageDigest md = null;
+            if (Downloader.this.checksumAlgorithm != null && item.checksum != 
null) {
+                md = 
MessageDigest.getInstance(Downloader.this.checksumAlgorithm);
+            }
+
             Path destinationPath = Paths.get(item.destination);
             Files.createDirectories(destinationPath.getParent());
-            long size;
+            long size = 0;
             try (ReadableByteChannel byteChannel = 
Channels.newChannel(sourceUrl.getInputStream());
                  FileOutputStream outputStream = new 
FileOutputStream(destinationPath.toFile())) {

Review Comment:
   Build a FileChannel directly instead of getting the file channel from an 
output stream. Something like this (not tested):
   ```
   SeekableByteChannel outputStream = Files.newByteChannel(destinationPath, 
StandardOpenOption.WRITE, StandardOpenOption.CREATE_NEW)
   ```
   
   Or else just use streams so we do not mix channels and streams.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [jackrabbit-oak] nfsantos commented on a diff in pull request #886: OAK-10171: datastore-copy cmd: add checksum validation

Reply via email to