sodonnel commented on a change in pull request #2838:
URL: https://github.com/apache/hadoop/pull/2838#discussion_r604770420
##########
File path:
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
##########
@@ -1345,8 +1363,18 @@ public boolean accept(File dir, String name) {
throw new IOException("Failed to mkdirs " + blockLocation);
}
}
- idBasedLayoutSingleLinks.add(new LinkArgs(new File(from, blockName),
- new File(blockLocation, blockName)));
+ /**
+ * The destination path is 32x32, so 1024 distinct paths. Therefore
+ * we cache the destination path and reuse the same File object on
+ * potentially thousands of blocks located on this volume.
+ * This method is called recursively so the cache is passed through
+ * each recursive call. There is one cache per volume, and it is only
+ * accessed by a single thread so no locking is needed.
+ */
+ File cachedDest = pathCache
+ .computeIfAbsent(blockLocation, k -> blockLocation);
+ idBasedLayoutSingleLinks.add(new LinkArgs(from,
Review comment:
Yes. This is the part of the change that really reduces the memory. As
src blocks are checked, it will produce destination paths like:
```
/dest/subdir1/subdir10 ***
/dest/subdir12/subdir1
/dest/subdir3/subdir1
/dest/subdir1/subdir10 *** These two are the same, but unless we cache the
instance we cannot reuse the object
```
There are only 1024 unique paths, but if we just use the calculated path
each time, we will have thousands of File objects each with the same path
stored in them.
We still create a temporary File object with the destination, but then find
the unique instance for it in the hashmap and store it in the LinkArg. Then the
temporary one will be GC'ed and we will end up with at most 1024 unique file
objects for dst, saving a lot of memory.
The same holds for the src path, but due to the way the method works, we
don't need to cache them. The method finds all the blocks in the src path
during the same call, so we just store the passed in src patch each time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]