Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
@srowen yes, I am pretty sure that this code generates all these duplicate
objects. I've analyzed a heap dump from a real customer, so I cannot publish
the entire jxray report, since it may potentially reveal some sensitive data.
But I've just attached a screenshot of the part that covers duplicate strings
to https://issues.apache.org/jira/browse/SPARK-24356 - hopefully it will give
you a better idea of what's going on here. Note that the biggest part of the
overhead (10.5%) come from FileInputStream.path strings. The FileInputStream
objects themselves are unreachable (but they apparently stay in memory for long
enough - if they were GCed immediately, they wouldn't be reflected in the heap
dump). The sample of paths in FileInputStream.path look identical to the sample
of File.paths. This, plus looking at the code, makes me think that the spark
code in question generates a large number of File objects with a small number
of unique paths, and then generates an even larger number of
FileInputStream objects with the same paths.
I don't think it can be optimized somewhere else, short of accessing and
interning paths directly in the File objects using Java reflection. As to why
this produces many different copies of identical Strings - consider the
following simple code:
String s1 = "foo";
String bar = "bar";
String s1 = foo + "/" + bar;
String s2 = foo + "/" + bar;
System.out.println(s1.equals(s2));
System.out.println(s1 == s2);
This will print
true
false
That is, s1 and s2 have the same contents, but they are two different
objects, taking twice the memory.
In the spark code in question we have the same main problem: by
concatenating identical initial strings, it keeps generating multiple copies of
identical "derived" strings. To remove duplicates among these derived strings,
we need to apply String.intern(). However, to make sure that the code in
java.io.File will not subsequently change our canonicalized strings if they are
not in the normalized form, we need to do this normalization ourselves as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]