[
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-20765:
---------------------------------
Labels: bulk-closed (was: )
> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage
> (Transformer or Estimator) if the package name of stage is not
> "org.apache.spark" and "pyspark"
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.0, 2.1.0, 2.2.0
> Reporter: APeng Zhang
> Priority: Major
> Labels: bulk-closed
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will
> invoke JavaParams._from_java() to create Python instance of persisted stage.
> In JavaParams._from_java(), the name of python class is derived from java
> class name by replace string "pyspark" with "org.apache.spark". This is OK
> for ML Transformer and Estimator inside PySpark, but for 3rd party
> Transformer and Estimator if package name is not org.apache.spark and
> pyspark, there will be an error:
> File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line
> 228, in load
> return cls.read().load(path)
> File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line
> 180, in load
> return self._clazz._from_java(java_obj)
> File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py",
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py",
> line 169, in _from_java
> py_type = __get_class(stage_name)
> File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py",
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name =
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]