In a nutshell, what I'd like to do is persist a instantiate a Pipeline (or extension class of Pipeline) with metadata that is copied to the PipelineModel when fitted, and can be read again when the fitted model is loaded by another consumer. These params are specific to the PipelineModel more than any particular Transform or the Estimator declared as part of the Pipeline, where the intent is that the PipelineModel params can be read by a downstream consumer of the loaded model, but the value that the params should take will only be known to the creator the of Pipeline/trainer of the PipelineModel.
It seems that Pipeline and PipelineModel support the Params interface, like Transform and Estimator do. It seems I can extend Pipeline to a custom class MyPipeline, where the constructor could enforce that my metadata Params are set. However, when the Pipeline is *fit*, the resultant PipelineModel doesn't seem to include the original CustomPipeline's params, only params from the individual Transform steps. >From a read of the code, it seems that the *fit* method will copy over the Stages to the PipelineModel, and those will be persisted (along with the Stages' Params) during *write*, *but* any Params belonging to the Pipeline are not copied to the PipelineModel (as only Stages are considered during copy, not the ParamMap of the Pipeline) [1]. Is this a correct read of the flow here? That a CustomPipeline extension of Pipeline with member Params does not get those non-Transform Params copied into the fitted PipelineMode? If so, would a feature enhancement including Pipeline-specific Params being copyable into the fitted PipelineModel be considered acceptable? Or should there be another way to include metadata *about* the Pipeline such that the metadata is copyable to the fitted PipelineModel, and able to be persisted with PipelineModel *write* and read again with PipelineModel *load*? My first attempt at this has been to extend the Pipeline class itself with member params, but this doesn't seem to do the trick given how Params are actually copied only for Stages between Pipeline and the fitted PipelineModel. It occurs to me I could write a custom *withMetadata* transform Stage which would really just an identity function but with the desired Params built in, and that those Params would get copied with the other Stages, but as discussed at the top, this particular use-case for metadata isn't about any particular Transform, but more about metadata for the whole Pipeline. Alek [1] -- https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L135