[
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251202#comment-17251202
]
Derek Tapley edited comment on SPARK-24632 at 12/17/20, 4:55 PM:
-----------------------------------------------------------------
I've been running into this problem as well, it seems like this could be solved
in several ways.
# Provide a way to override the line
{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark",
"pyspark"){code}
# Refactor `Pipeline` and `PipelineModel` to use:
{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of
{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}
Both ways would likely require custom (meta) Transformers/Estimators to
override both `_to_java` and `_from_java`. Is there any preference? I can
open a PR for either method, though I'm leaning towards the latter being easier
to implement provided it doesn't break existing functionality.
was (Author: derektapley):
I've been running into this problem as well, it seems like this could be solved
in several ways.
# Provide a way to override the line
stage_name = java_stage.getClass().getName().replace("org.apache.spark",
"pyspark")
# Refactor `Pipeline` and `PipelineModel` to use:
{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of
{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}
Both ways would likely require custom (meta) Transformers/Estimators to
override both `_to_java` and `_from_java`. Is there any preference? I can
open a PR for either method, though I'm leaning towards the latter being easier
to implement provided it doesn't break existing functionality.
> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers
> for persistence
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
> Issue Type: Improvement
> Components: ML, PySpark
> Affects Versions: 3.1.0
> Reporter: Joseph K. Bradley
> Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and
> use Pipeline persistence. This task is to make it easier for 3rd-party
> libraries to have PipelineStages written in Java and then to use pyspark.ml
> abstractions to create wrappers around those Java classes. This is currently
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the
> doc linked below. Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers
> implement a trait which provides the corresponding Python classpath in some
> field:
> {code}
> trait PythonWrappable {
> def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.
> They would ideally test a Java class + Python wrapper class pair sitting
> outside of pyspark.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]