In a nutshell, what I'd like to do is persist a instantiate a Pipeline (or
extension class of Pipeline) with metadata that is copied to the
PipelineModel when fitted, and can be read again when the fitted model is
loaded by another consumer. These params are specific to the PipelineModel
more than any particular Transform or the Estimator declared as part of the
Pipeline, where the intent is that the PipelineModel params can be read by
a downstream consumer of the loaded model, but the value that the params
should take will only be known to the creator the of Pipeline/trainer of
the PipelineModel.

It seems that Pipeline and PipelineModel support the Params interface, like
Transform and Estimator do. It seems I can extend Pipeline to a custom
class MyPipeline, where the constructor could enforce that my metadata
Params are set. However, when the Pipeline is *fit*, the resultant
PipelineModel doesn't seem to include the original CustomPipeline's params,
only params from the individual Transform steps.

>From a read of the code, it seems that the *fit* method will copy over the
Stages to the PipelineModel, and those will be persisted (along with the
Stages' Params) during *write*, *but* any Params belonging to the Pipeline
are not copied to the PipelineModel (as only Stages are considered during
copy, not the ParamMap of the Pipeline) [1].

Is this a correct read of the flow here? That a CustomPipeline extension of
Pipeline with member Params does not get those non-Transform Params copied
into the fitted PipelineMode?

If so, would a feature enhancement including Pipeline-specific Params being
copyable into the fitted PipelineModel be considered acceptable?

Or should there be another way to include metadata *about* the Pipeline
such that the metadata is copyable to the fitted PipelineModel, and able to
be persisted with PipelineModel *write* and read again with PipelineModel
*load*? My first attempt at this has been to extend the Pipeline class
itself with member params, but this doesn't seem to do the trick given how
Params are actually copied only for Stages between Pipeline and the fitted
PipelineModel.

It occurs to me I could write a custom *withMetadata* transform Stage which
would really just an identity function but with the desired Params built
in, and that those Params would get copied with the other Stages, but as
discussed at the top, this particular use-case for metadata isn't about any
particular Transform, but more about metadata for the whole Pipeline.

Alek

[1] --
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L135

Reply via email to