[
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248049#comment-16248049
]
Bago Amirbekian commented on SPARK-22346:
-----------------------------------------
I think [~josephkb]'s version of Option 3 makes the most sense. A transformer
that adds size data to a vector column would allow patching pipelines pretty
easily and it could be implemented without breaking any APIs.
I'm currently working on a PR based on this approach.
> Update VectorAssembler to work with Structured Streaming
> --------------------------------------------------------
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
> Issue Type: Improvement
> Components: ML, Structured Streaming
> Affects Versions: 2.2.0
> Reporter: Bago Amirbekian
> Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and
> assemble a output a new column of VectorType containing the concatenated
> vectors. In streaming mode, this transformation can fail because
> VectorAssembler does not have enough information to produce metadata
> (AttributeGroup) for the new column. Because VectorAssembler is such a
> ubiquitous part of mllib pipelines, this issue effectively means spark
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve
> VectorAssembler. Please let me know if there are any issues I have not
> considered or potential fixes I haven't outlined. I'm happy to submit a patch
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently
> done with OneHotEncoder,
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The
> Estimator can "learn" the size of the inputs vectors during training and save
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty
> major change, but it could be done in stages. We could first ensure that
> metadata is not used during prediction and allow the VectorAssembler to drop
> metadata for streaming dataframes. Going forward, it would be important to
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used
> during transform, only during fit) might be tricky. This would require
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib
> transformers that produce vectors already include the size of the vector in
> the column metadata. This change would be to deprecate APIs that allow
> creating a vector column of unknown length and replace those APIs with
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes
> the inputs * output col are fixed size vectors and creates metadata
> accordingly. In the spirit of explicit is better than implicit, we would be
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]