Github user countmdm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21456#discussion_r192217237
--- Diff:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
---
@@ -272,6 +273,57 @@ void close() {
}
}
+ /**
+ * This method is needed to avoid the situation when multiple File
instances for the
+ * same pathname "foo/bar" are created, each with a separate copy of the
"foo/bar" String.
+ * According to measurements, in some scenarios such duplicate strings
may waste a lot
+ * of memory (~ 10% of the heap). To avoid that, we intern the pathname,
and before that
+ * we make sure that it's in a normalized form (contains no "//", "///"
etc.) Otherwise,
+ * the internal code in java.io.File would normalize it later, creating
a new "foo/bar"
+ * String copy. Unfortunately, we cannot just reuse the normalization
code that java.io.File
+ * uses, since it is in the package-private class java.io.FileSystem.
+ */
+ private static String createNormalizedInternedPathname(String dir1,
String dir2, String fname) {
+ String pathname = dir1 + File.separator + dir2 + File.separator +
fname;
+ pathname = normalizePathname(pathname);
+ return pathname.intern();
+ }
+
+ @VisibleForTesting
+ static String normalizePathname(String pathname) {
+ int len = pathname.length();
+ char prevChar = 0;
+ for (int i = 0; i < len; i++) {
+ char c = pathname.charAt(i);
+ if (c == '/' && prevChar == '/') {
+ return normalizePathnameRemainder(pathname, i - 1);
+ }
+ prevChar = c;
+ }
+ if (prevChar == '/') return normalizePathnameRemainder(pathname, len -
1);
--- End diff --
Will change the code to put the last 'return' in new line and braces
(though I personally think that the obvious one-line statements like 'return'
and 'continue' can be exempt from this rule).
Regarding replacing this with regex: being a performance-oriented guy, I
know too well that regex code can be notoriously slow. Probably it's because
some general-purpose logic is used almost regardless of the complexity of your
expression. Here it looks like this code may be called quite intensively, so it
seems prudent to use custom-made code (largely copied from java.io.File source)
that's guaranteed to run fast.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]