Nick Pentreath created SPARK-13969:
--------------------------------------

             Summary: Extend input format that feature hashing can handle
                 Key: SPARK-13969
                 URL: https://issues.apache.org/jira/browse/SPARK-13969
             Project: Spark
          Issue Type: Sub-task
          Components: ML, MLlib
            Reporter: Nick Pentreath
            Priority: Minor


Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
strings and computes term frequencies.

The use cases for feature hashing extend to arbitrary feature values (binary, 
count or real-valued). For example, scikit-learn's {{FeatureHasher}} can accept 
a sequence of (feature_name, value) pairs (e.g. a map, list). In this way, 
feature hashing can operate as both "one-hot encoder" and "vector assembler" at 
the same time.

Investigate adding a more generic feature hasher (that in turn can be used by 
{{HashingTF}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to