Nick Pentreath created SPARK-13969:
--------------------------------------
Summary: Extend input format that feature hashing can handle
Key: SPARK-13969
URL: https://issues.apache.org/jira/browse/SPARK-13969
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Nick Pentreath
Priority: Minor
Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in
scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of
strings and computes term frequencies.
The use cases for feature hashing extend to arbitrary feature values (binary,
count or real-valued). For example, scikit-learn's {{FeatureHasher}} can accept
a sequence of (feature_name, value) pairs (e.g. a map, list). In this way,
feature hashing can operate as both "one-hot encoder" and "vector assembler" at
the same time.
Investigate adding a more generic feature hasher (that in turn can be used by
{{HashingTF}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]