Saif Addin created SPARK-21556:
----------------------------------

             Summary: PySpark, Unable to save pipeline of non-spark transformers
                 Key: SPARK-21556
                 URL: https://issues.apache.org/jira/browse/SPARK-21556
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 2.1.1
            Reporter: Saif Addin
            Priority: Minor


We are working on creating some new ML transformers following the same Spark / 
PyPark design pattern.
When in PySpark, though, we are unable to deserialize, or read Pipelines, made 
of such new Transformers, due to a hardcoded class path name in *wrapper.py*

https://github.com/apache/spark/blob/master/python/pyspark/ml/wrapper.py#L200

So this line makes pipeline components work only if JVM classes are equivalent 
to Python classes with the root replaced. But, would not be working for more 
general use cases.

The first workaround that comes to mind, is use the same pathing for pyspark 
side than jvm side.

The error, when trying to load a Pipeline from path in such circumstances is 

{code:java}

E
======================================================================
ERROR: runTest (test.annotators.PipelineTestSpec)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/saif/IdeaProjects/this_project/test/annotators.py", line 208, in 
runTest
    loaded_pipeline = Pipeline.read().load(pipe_path)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/util.py",
 line 198, in load
    return self._clazz._from_java(java_obj)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 155, in _from_java
    py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 155, in <listcomp>
    py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
 line 173, in _from_java
    py_type = __get_class(stage_name)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
 line 167, in __get_class
    m = __import__(module)
ModuleNotFoundError: No module named 'com.frh'

{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to