Joseph K. Bradley created SPARK-6915:
----------------------------------------
Summary: VectorIndexer improvements
Key: SPARK-6915
URL: https://issues.apache.org/jira/browse/SPARK-6915
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor
This covers several improvements to VectorIndexer. They could be handled
separately or in 1 PR.
*Preserve metadata*
Currently, it does not preserve ML-specific input metadata in the output
column. If a feature is already marked as categorical or continuous, we should
preserve that metadata (rather than recomputing it). We should also check that
the input data is valid for that metadata.
*Allow unknown categories*
Add option for allowing unknown categories, probably via a parameter like
"allowUnknownCategories."
If true, then handle unknown categories during transform by assigning them to
an extra category index.
*Index particular features*
Add option for limiting indexing to particular features.
This could be specified by an option, or we could handle it via the "Preserve
metadata" task above, where users would denote features as continuous in order
to have VectorIndexer ignore them.
*Performance optimizations*
See the TODO items within VectorIndexer.scala
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]