Hi folks, I've been giving a bit of thought to trying to improve ML exporting in Spark to support a wider variety of formats. If you implement pipeline stages, or you've added your own export logic, I'd especially love your input.
A quick little draft of what I've been thinking about (after jumping back into my ancient PR # 9207 ) is as follows: # Background The current Spark ML writer only supports a Spark "internal" format. This is less than ideal since Spark MLlib supports PMML, and more formats exist. The goal of this design document is to allow more general support of saving Spark ML pipeline stages and models. Additionally Spark ML has a growing ecosystem of additional pipeline stages outside of core Spark, so any design should be usable by 3rd party pipeline stages. # Design sketch Spark's DataFrameWriter interface provides a starting point for this design. When writing the user will be able to specify a path, general options passed to the format, and importantly the format. Format discovery will be accomplished in a similar manner to Spark Datasources (Java's ServiceLoader), however since individual models providers may wish to implement their own version of a Spark supported format the writer will be looked by "formatname+pipelinestageclassname." This has the downside of making the code not necessarily as easy to trace through as the current structure, but opens up the possibility of allowing folks to provide model export in additional formats not supported inside of the models its self. # Migration path External pipeline stages may already implement the current MLWriter. To allow these to continue to work a GeneralMLWriter will be created as a parent class to the current MLWriter which will handle delegation for other formats as described above. For existing stages, the MLWriter's save function will be changed to check it's input format is the default and delegate to the current saveImpl. We would then deprecate MLWriter in the next version, remove it in Spark 3. Does this sound reasonable to folks? It would allow us to add PMML support in Spark ML pipelines and open it up for other folks to fill in the gaps or add other custom formats. Cheers, Holden :) -- Twitter: https://twitter.com/holdenkarau