[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251518#comment-16251518
 ] 

Apache Spark commented on SPARK-22346:
--------------------------------------

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/19746

> Update VectorAssembler to work with Structured Streaming
> --------------------------------------------------------
>
>                 Key: SPARK-22346
>                 URL: https://issues.apache.org/jira/browse/SPARK-22346
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to