[
https://issues.apache.org/jira/browse/SPARK-17777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ameen Tayyebi updated SPARK-17777:
----------------------------------
Attachment: jstack-dump.txt
> Spark Scheduler Hangs Indefinitely
> ----------------------------------
>
> Key: SPARK-17777
> URL: https://issues.apache.org/jira/browse/SPARK-17777
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
> Reporter: Ameen Tayyebi
> Attachments: jstack-dump.txt, repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself
> when an RDD calls SparkContext.parallelize within its getPartitions method.
> This seemingly "recursive" call causes the problem. We have a repro case that
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well,
> we have an RDD that is composed of several thousands of Parquet files. To
> compute the partitioning strategy for this RDD, we create an RDD to read all
> file sizes from S3 in parallel, so that we can quickly determine the proper
> partitions. We do this to avoid executing this serially from the master node
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f,
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>
> Thanks,
> -Ameen
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]