[ https://issues.apache.org/jira/browse/SPARK-29017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924975#comment-16924975 ]
Imran Rashid commented on SPARK-29017: -------------------------------------- [~hyukjin.kwon][~holdensmagicalunicorn] [~bryanc] any thoughts on how to fix this issue in py4j? [~felixcheung] I don't think this issue exists in SparkR, because I didn't see anything like {{setJobGroup}} or {{setLocalProperty}} but maybe I'm not looking in the right place. > JobGroup and LocalProperty not respected by PySpark > --------------------------------------------------- > > Key: SPARK-29017 > URL: https://issues.apache.org/jira/browse/SPARK-29017 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.4 > Reporter: Imran Rashid > Priority: Major > > Pyspark has {{setJobGroup}} and {{setLocalProperty}} methods, which are > intended to set properties which only effect the calling thread. They try to > do this my calling the equivalent JVM functions via Py4J. > However, there is nothing ensuring that subsequent py4j calls from a python > thread call into the same thread in java. In effect, this means this methods > might appear to work some of the time, if you happen to get lucky and get the > same thread on the java side. But then sometimes it won't work, and in fact > its less likely to work if there are multiple threads in python submitting > jobs. > I think the right way to fix this is to keep a *python* thread-local tracking > these properties, and then sending them through to the JVM on calls to > submitJob. This is going to be a headache to get right, though; we've also > got to handle implicit calls, eg. {{rdd.collect()}}, {{rdd.forEach()}}, etc. > And of course users may have defined their own functions, which will be > broken until they fix it to use the same thread-locals. > An alternative might be to use what py4j calls the "Single Threading Model" > (https://www.py4j.org/advanced_topics.html#the-single-threading-model). I'd > want to look more closely at the py4j implementation of how that works first. > I can't think of any guaranteed workaround, but I think you could increase > your chances of getting the desired behavior if you always set those > properties just before each call to runJob. Eg., instead of > {code:python} > # more likely to trigger bug this way > sc.setJobGroup("a") > rdd1.collect() # or whatever other ways you submit a job > rdd2.collect() > # lots more stuff ... > rddN.collect() > {code} > change it to > {code:python} > # slightly safer, but still no guarantees > sc.setJobGroup("a") > rdd1.collect() # or whatever other ways you submit a job > sc.setJobGroup("a") > rdd2.collect() > # lots more stuff ... > sc.setJobGroup("a") > rddN.collect() > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org