[
https://issues.apache.org/jira/browse/SPARK-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-6915:
--------------------------------
Labels: bulk-closed (was: )
> VectorIndexer improvements
> --------------------------
>
> Key: SPARK-6915
> URL: https://issues.apache.org/jira/browse/SPARK-6915
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.4.0
> Reporter: Joseph K. Bradley
> Priority: Minor
> Labels: bulk-closed
>
> This covers several improvements to VectorIndexer. They could be handled
> separately or in 1 PR.
> *Preserving metadata*
> Currently, it preserves non-ML metadata. This is different from
> StringIndexer. We should change it so it does not maintain non-ML metadata.
> Currently, it does not preserve ML-specific input metadata in the output
> column. If a feature is already marked as categorical or continuous, we
> should preserve that metadata (rather than recomputing it). We should also
> check that the input data is valid for that metadata.
> *Allow unknown categories*
> Add option for allowing unknown categories, probably via a parameter like
> "allowUnknownCategories."
> If true, then handle unknown categories during transform by assigning them to
> an extra category index.
> *Index particular features*
> Add option for limiting indexing to particular features.
> This could be specified by an option, or we could handle it via the "Preserve
> metadata" task above, where users would denote features as continuous in
> order to have VectorIndexer ignore them.
> *Performance optimizations*
> See the TODO items within VectorIndexer.scala
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]