Github user countmdm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21456#discussion_r192217237
  
    --- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
    @@ -272,6 +273,57 @@ void close() {
         }
       }
     
    +  /**
    +   * This method is needed to avoid the situation when multiple File 
instances for the
    +   * same pathname "foo/bar" are created, each with a separate copy of the 
"foo/bar" String.
    +   * According to measurements, in some scenarios such duplicate strings 
may waste a lot
    +   * of memory (~ 10% of the heap). To avoid that, we intern the pathname, 
and before that
    +   * we make sure that it's in a normalized form (contains no "//", "///" 
etc.) Otherwise,
    +   * the internal code in java.io.File would normalize it later, creating 
a new "foo/bar"
    +   * String copy. Unfortunately, we cannot just reuse the normalization 
code that java.io.File
    +   * uses, since it is in the package-private class java.io.FileSystem.
    +   */
    +  private static String createNormalizedInternedPathname(String dir1, 
String dir2, String fname) {
    +    String pathname = dir1 + File.separator + dir2 + File.separator + 
fname;
    +    pathname = normalizePathname(pathname);
    +    return pathname.intern();
    +  }
    +
    +  @VisibleForTesting
    +  static String normalizePathname(String pathname) {
    +    int len = pathname.length();
    +    char prevChar = 0;
    +    for (int i = 0; i < len; i++) {
    +      char c = pathname.charAt(i);
    +      if (c == '/' && prevChar == '/') {
    +        return normalizePathnameRemainder(pathname, i - 1);
    +      }
    +      prevChar = c;
    +    }
    +    if (prevChar == '/') return normalizePathnameRemainder(pathname, len - 
1);
    --- End diff --
    
    Will change the code to put the last 'return' in new line and braces 
(though I personally think that the obvious one-line statements like 'return' 
and 'continue' can be exempt from this rule).
    
    Regarding replacing this with regex: being a performance-oriented guy, I 
know too well that regex code can be notoriously slow. Probably it's because 
some general-purpose logic is used almost regardless of the complexity of your 
expression. Here it looks like this code may be called quite intensively, so it 
seems prudent to use custom-made code (largely copied from java.io.File source) 
that's guaranteed to run fast.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to