[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

countmdm Thu, 31 May 2018 11:24:56 -0700

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    @srowen yes, I am pretty sure that this code generates all these duplicate 
objects. I've analyzed a heap dump from a real customer, so I cannot publish 
the entire jxray report, since it may potentially reveal some sensitive data. 
But I've just attached a screenshot of the part that covers duplicate strings 
to https://issues.apache.org/jira/browse/SPARK-24356 - hopefully it will give 
you a better idea of what's going on here. Note that the biggest part of the 
overhead (10.5%) come from FileInputStream.path strings. The FileInputStream 
objects themselves are unreachable (but they apparently stay in memory for long 
enough - if they were GCed immediately, they wouldn't be reflected in the heap 
dump). The sample of paths in FileInputStream.path look identical to the sample 
of File.paths. This, plus looking at the code, makes me think that the spark 
code in question generates a large number of File objects with a small number 
of unique paths, and then generates an even larger number of
  FileInputStream objects with the same paths.
    
    I don't think it can be optimized somewhere else, short of accessing and 
interning paths directly in the File objects using Java reflection. As to why 
this produces many different copies of identical Strings - consider the 
following simple code:
    
        String s1 = "foo";
        String bar = "bar";
        String s1 = foo + "/" + bar;
        String s2 = foo + "/" + bar;
        System.out.println(s1.equals(s2));
        System.out.println(s1 == s2);
    
    This will print
        true
        false
    
    That is, s1 and s2 have the same contents, but they are two different 
objects, taking twice the memory.
    
    In the spark code in question we have the same main problem: by 
concatenating identical initial strings, it keeps generating multiple copies of 
identical "derived" strings. To remove duplicates among these derived strings, 
we need to apply String.intern(). However, to make sure that the code in 
java.io.File will not subsequently change our canonicalized strings if they are 
not in the normalized form, we need to do this normalization ourselves as well.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Reply via email to