Saif Addin created SPARK-21556:
----------------------------------
Summary: PySpark, Unable to save pipeline of non-spark transformers
Key: SPARK-21556
URL: https://issues.apache.org/jira/browse/SPARK-21556
Project: Spark
Issue Type: Bug
Components: ML, PySpark
Affects Versions: 2.1.1
Reporter: Saif Addin
Priority: Minor
We are working on creating some new ML transformers following the same Spark /
PyPark design pattern.
When in PySpark, though, we are unable to deserialize, or read Pipelines, made
of such new Transformers, due to a hardcoded class path name in *wrapper.py*
https://github.com/apache/spark/blob/master/python/pyspark/ml/wrapper.py#L200
So this line makes pipeline components work only if JVM classes are equivalent
to Python classes with the root replaced. But, would not be working for more
general use cases.
The first workaround that comes to mind, is use the same pathing for pyspark
side than jvm side.
The error, when trying to load a Pipeline from path in such circumstances is
{code:java}
E
======================================================================
ERROR: runTest (test.annotators.PipelineTestSpec)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/saif/IdeaProjects/this_project/test/annotators.py", line 208, in
runTest
loaded_pipeline = Pipeline.read().load(pipe_path)
File
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/util.py",
line 198, in load
return self._clazz._from_java(java_obj)
File
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
line 155, in _from_java
py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
File
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
line 155, in <listcomp>
py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
File
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
line 173, in _from_java
py_type = __get_class(stage_name)
File
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
line 167, in __get_class
m = __import__(module)
ModuleNotFoundError: No module named 'com.frh'
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]