[
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193392#comment-16193392
]
Bago Amirbekian commented on SPARK-21926:
-----------------------------------------
[~mslipper] The trickiest thing about 1 (b) is knowing how to test that it
won't change behaviour. I'd like run this past some folks with more MLlib
experience to see if there are any obvious issues with this approach that we
haven't considered.
> Some transformers in spark.ml.feature fail when trying to transform streaming
> dataframes
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
> Issue Type: Bug
> Components: ML, Structured Streaming
> Affects Versions: 2.2.0
> Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming
> dataframes (for prediction). This ticket is meant to help aggregate these
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]