[
https://issues.apache.org/jira/browse/SPARK-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259745#comment-14259745
]
Josh Rosen commented on SPARK-4778:
-----------------------------------
Are there any errors in the executor log where that hanging {{take}} task runs?
If you don't have executor logs or a reproduction, I'm inclined to close this
JIRA as "Cannot reproduce."
> PySpark Json and groupByKey broken
> ----------------------------------
>
> Key: SPARK-4778
> URL: https://issues.apache.org/jira/browse/SPARK-4778
> Project: Spark
> Issue Type: Bug
> Components: EC2, PySpark
> Affects Versions: 1.1.1
> Environment: ec2 cluster launched from ec2 script
> pyspark
> c3.2xlarge 6 nodes
> hadoop major version 1
> Reporter: Brad Willard
>
> When I run a groupByKey it seems to create a single tasks after the
> groupByKey that never stops executing. I'm loading a smallish json dataset
> that is 4 million records. This is the code I'm running.
> rdd = sql_context.jsonFile(hdfs_uri)
> rdd = rdd.cache()
> grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160)
> grouped.take(1)
> The groupByKey stage takes a few minutes which I'd expect. However the take
> operation never completes. It it hands indefinitely.
> This is what it looks like in UI
> http://cl.ly/image/2k1t3I253T0x
> The only work around I have at the moment is to run a map operation after I
> loaded from json to convert all the Row objects to python dictionary objects
> and then things work although the map operation is expensive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]