[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

Peter D Kirchner (JIRA) Tue, 07 Feb 2017 14:21:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856922#comment-15856922
 ]


Peter D Kirchner commented on SPARK-19348:
------------------------------------------

Two things happen with this wrapper.

First, it appears to me (after confirming with some simplified examples) that 
the modifications to incoming arguments made in the body of the wrapped 
function are lost (to take the pipeline.py example, stages=) when they are not 
updated in the dictionary that is passed in the calls made from inside the 
wrapped functions.  The decorator explicitly states that it saves the 'actual 
input arguments' but does not clarify why.  It seems as if the wrapped code 
should either update the dictionary or that lines of code that have no lasting 
effect should be deleted.

Second, bryanc has pointed out that passing the kwargs to the wrapped function 
via a static class variable is thread-unsafe within each of the many ml classes 
that use the decorator.  Passing the kwargs as an instance variable as bryanc 
has proposed seems satisfactory for the second solution, as would using a 
thread-local class variable.  Both require changes to any files using the 
decorator. Locking in the decorator, if it could be implemented in spite of the 
nested calls of decorated functions, could confine the changes to the decorator 
definition.  Passing the wrapper's kwargs dictionary as an additional entry in 
the kwargs dictionary passed to the wrapped function would be threadsafe but 
change the public API of many functions.  It looks possible for a decorator to 
introspect the wrapped function's parameters in which case the wrapper could 
pass the dictionary on one of those keywords (the purpose being to leave the 
public API intact), then inside the wrapped function there would need to be 
code to detect the wrapper, retrieve the dictionary and restore the coopted 
variable.  (As-is, the wrapped functions already have wrapper-specific code in 
order to function.)

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> -----------------------------------------------------------
>
>                 Key: SPARK-19348
>                 URL: https://issues.apache.org/jira/browse/SPARK-19348
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 1.6.0, 2.0.0, 2.1.0
>            Reporter: Vinayak Joshi
>         Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

Reply via email to