[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856922#comment-15856922 ]
Peter D Kirchner commented on SPARK-19348: ------------------------------------------ Two things happen with this wrapper. First, it appears to me (after confirming with some simplified examples) that the modifications to incoming arguments made in the body of the wrapped function are lost (to take the pipeline.py example, stages=) when they are not updated in the dictionary that is passed in the calls made from inside the wrapped functions. The decorator explicitly states that it saves the 'actual input arguments' but does not clarify why. It seems as if the wrapped code should either update the dictionary or that lines of code that have no lasting effect should be deleted. Second, bryanc has pointed out that passing the kwargs to the wrapped function via a static class variable is thread-unsafe within each of the many ml classes that use the decorator. Passing the kwargs as an instance variable as bryanc has proposed seems satisfactory for the second solution, as would using a thread-local class variable. Both require changes to any files using the decorator. Locking in the decorator, if it could be implemented in spite of the nested calls of decorated functions, could confine the changes to the decorator definition. Passing the wrapper's kwargs dictionary as an additional entry in the kwargs dictionary passed to the wrapped function would be threadsafe but change the public API of many functions. It looks possible for a decorator to introspect the wrapped function's parameters in which case the wrapper could pass the dictionary on one of those keywords (the purpose being to leave the public API intact), then inside the wrapped function there would need to be code to detect the wrapper, retrieve the dictionary and restore the coopted variable. (As-is, the wrapped functions already have wrapper-specific code in order to function.) > pyspark.ml.Pipeline gets corrupted under multi threaded use > ----------------------------------------------------------- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.6.0, 2.0.0, 2.1.0 > Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org