[
https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian updated SPARK-13872:
------------------------
Description:
SortMergeJoin composes its partition/iterator from
org.apache.spark.sql.execution.Sort, which in turns designates the sorting to
UnsafeExternalRowSorter.
UnsafeExternalRowSorter's implementation cleans up the resources when:
1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully
iterated.
2. task is done execution.
In outer join case of SortMergeJoin, when the left or right iterator is not
fully iterated, the only chance for the resources to be cleaned up is at the
end of the spark task run. This probably ok most of the time, however when a
SortMergeOuterJoin is nested within a CartesianProduct, the "deferred"
resources cleanup becomes a none-ignorable memory leak, amplified by the loop
driven by the CartesianRdd's looping iteration.
was:
SortMergeJoin composes its partition/iterator from
org.apache.spark.sql.execution.Sort, which in turns designates the sorting to
UnsafeExternalRowSorter.
UnsafeExternalRowSorter's implementation cleans up the resources when:
1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully
iterated.
2. task is done execution.
In outer join case of SortMergeJoin, when the left or right iterator is not
fully iterated, the only chance for the resources to be cleaned up is at the
end of the spark task run. This probably ok most of the time, however when a
SortMergeOuterJoin is nested within a CartesianProduct, the "deferred"
resources cleanup becomes a none-ignorable memory leak, amplified by the loop
driven by the CartesianRdd's outter loop iteration.
> Memory leak in SortMergeOuterJoin
> ---------------------------------
>
> Key: SPARK-13872
> URL: https://issues.apache.org/jira/browse/SPARK-13872
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1
> Reporter: Ian
> Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png
>
>
> SortMergeJoin composes its partition/iterator from
> org.apache.spark.sql.execution.Sort, which in turns designates the sorting to
> UnsafeExternalRowSorter.
> UnsafeExternalRowSorter's implementation cleans up the resources when:
> 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully
> iterated.
> 2. task is done execution.
> In outer join case of SortMergeJoin, when the left or right iterator is not
> fully iterated, the only chance for the resources to be cleaned up is at the
> end of the spark task run. This probably ok most of the time, however when a
> SortMergeOuterJoin is nested within a CartesianProduct, the "deferred"
> resources cleanup becomes a none-ignorable memory leak, amplified by the loop
> driven by the CartesianRdd's looping iteration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]