attilapiros commented on a change in pull request #28967:
URL: https://github.com/apache/spark/pull/28967#discussion_r449055382
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java
##########
@@ -45,34 +31,16 @@ public static File getFile(String[] localDirs, int
subDirsPerLocalDir, String fi
int hash = JavaUtils.nonNegativeHash(filename);
String localDir = localDirs[hash % localDirs.length];
int subDirId = (hash / localDirs.length) % subDirsPerLocalDir;
- return new File(createNormalizedInternedPathname(
- localDir, String.format("%02x", subDirId), filename));
- }
-
- /**
- * This method is needed to avoid the situation when multiple File instances
for the
- * same pathname "foo/bar" are created, each with a separate copy of the
"foo/bar" String.
- * According to measurements, in some scenarios such duplicate strings may
waste a lot
- * of memory (~ 10% of the heap). To avoid that, we intern the pathname, and
before that
- * we make sure that it's in a normalized form (contains no "//", "///"
etc.) Otherwise,
- * the internal code in java.io.File would normalize it later, creating a
new "foo/bar"
- * String copy. Unfortunately, we cannot just reuse the normalization code
that java.io.File
- * uses, since it is in the package-private class java.io.FileSystem.
- *
- * On Windows, separator "\" is used instead of "/".
- *
- * "\\" is a legal character in path name on Unix-like OS, but illegal on
Windows.
- */
- @VisibleForTesting
- static String createNormalizedInternedPathname(String dir1, String dir2,
String fname) {
- String pathname = dir1 + File.separator + dir2 + File.separator + fname;
- Matcher m = MULTIPLE_SEPARATORS.matcher(pathname);
- pathname = m.replaceAll(Matcher.quoteReplacement(File.separator));
- // A single trailing slash needs to be taken care of separately
- if (pathname.length() > 1 && pathname.charAt(pathname.length() - 1) ==
File.separatorChar) {
- pathname = pathname.substring(0, pathname.length() - 1);
- }
- return pathname.intern();
+ final String notNormalizedPath =
+ localDir + File.separator + String.format("%02x", subDirId) +
File.separator + filename;
+ // Interning the normalized path as according to measurements, in some
scenarios such
+ // duplicate strings may waste a lot of memory (~ 10% of the heap).
+ // Unfortunately, we cannot just call the normalization code that
java.io.File
+ // uses, since it is in the package-private class java.io.FileSystem.
+ // So we are creating a File just to get the normalized path back to
intern it.
+ // Finally a new File is built and returned with this interned normalized
path.
+ final String normalizedInternedPath = new
File(notNormalizedPath).getPath().intern();
Review comment:
No, it does not.
As the first `File` instance is on the heap and will be garbage collected as
was this string in the old solution if the replace take has some effect:
https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java#L68
Or even the replace result string when it ended with the `separatorChar `
and transformed further:
https://github.com/apache/spark/blob/7fda184f0fc39613fb68e912c189c54b93c638e6/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java#L70
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]