Hi folks,

I've been giving a bit of thought to trying to improve ML exporting in
Spark to support a wider variety of formats. If you implement pipeline
stages, or you've added your own export logic, I'd especially love your
input.

A quick little draft of what I've been thinking about (after jumping back
into my ancient PR # 9207 ) is as follows:

# Background

The current Spark ML writer only supports a Spark "internal" format. This
is less than ideal since Spark MLlib supports PMML, and more formats exist.
The goal of this design document is to allow more general support of saving
Spark ML pipeline stages and models.

Additionally Spark ML has a growing ecosystem of additional pipeline stages
outside of core Spark, so any design should be usable by 3rd party pipeline
stages.

# Design sketch

Spark's DataFrameWriter interface provides a starting point for this
design. When writing the user will be able to specify a path, general
options passed to the format, and importantly the format.

Format discovery will be accomplished in a similar manner to Spark
Datasources (Java's ServiceLoader), however since individual models
providers may wish to implement their own version of a Spark supported
format the writer will be looked by "formatname+pipelinestageclassname."

This has the downside of making the code not necessarily as easy to trace
through as the current structure, but opens up the possibility of allowing
folks to provide model export in additional formats not supported inside of
the models its self.

# Migration path

External pipeline stages may already implement the current MLWriter. To
allow these to continue to work a GeneralMLWriter will be created as a
parent class to the current MLWriter which will handle delegation for other
formats as described above.

For existing stages, the MLWriter's save function will be changed to check
it's input format is the default and delegate to the current saveImpl.

We would then deprecate MLWriter in the next version, remove it in Spark 3.

Does this sound reasonable to folks? It would allow us to add PMML support
in Spark ML pipelines and open it up for other folks to fill in the gaps or
add other custom formats.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau

Reply via email to