[
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579184#comment-16579184
]
Imran Rashid commented on SPARK-24356:
--------------------------------------
Somewhat related to SPARK-24938 -- that explains why these buffers are even on
the heap at all, as spark configures netty to use offheap buffers by default.
> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> ------------------------------------------------------------------
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle
> Affects Versions: 2.3.0
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24356.01.patch, dup-file-strings-details.png
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from
> high GC pressure due to high object churn. Analysis was done with the jxray
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a
> number of well-known memory issues. One problem that it found in this dump is
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they
> are used for a very short period and then discarded (I guess there is a
> separate question of whether that's a good pattern). But {{File}} instances
> are traceable to
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here
> is the full reference chain:
>
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very
> similar, so I think {{FileInputStream}}s are generated by the
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely
> come from
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested
> that in the above code we create a File with a complete, normalized pathname,
> that has been already interned. This will prevent the code inside
> {{java.io.File}} from modifying this string, and thus it will use the
> interned copy, and will pass it to FileInputStream. Essentially the current
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)),
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) +
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]