[
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201257#comment-15201257
]
Nick Pentreath edited comment on SPARK-13969 at 3/18/16 10:00 AM:
------------------------------------------------------------------
What I have in mind is something like the following:
{code}
// class FeatureHasher extends Transformer ...
val df = sqlContext.createDataFrame(Seq(
(3.5, "foo", Seq("woo", "woo")),
(5.3, "bar", Seq("baz", "baz")))).toDF("real", "categorical", "raw_text")
df.show
// +----+-----------+----------+
// |real|categorical| raw_text|
// +----+-----------+----------+
// | 3.5| foo|[woo, woo]|
// | 5.3| bar|[baz, baz]|
// +----+-----------+----------+
val hasher = new FeatureHasher()
val result = hasher.transform(df)
result.show(false)
// +--------------------------+
// |features |
// +--------------------------+
// |(10,[3,5,6],[1.0,2.0,3.5])|
// |(10,[1,6,9],[1.0,5.3,2.0])|
// +--------------------------+
// numerical columns are handled by hashing column name to get vector index,
and using the value as the feature value
// string columns are handled by treating the features as categorical, hashing
the feature value (or "column_name=feature_value") to get index and setting
value = 1.0
// string sequence columns are handled in the same way as HashingTF currently,
i.e. same as for categorical but allowing for counts
{code}
[~mengxr] [~josephkb] would like to get your thoughts on this.
was (Author: mlnick):
What I have in mind is something like the following:
{code}
// class FeatureHasher extends Transformer ...
val df = sqlContext.createDataFrame(Seq(
(3.5, "foo", Seq("woo", "woo")),
(5.3, "bar", Seq("baz", "baz")))).toDF("real", "categorical", "raw_text")
df.show
// +----+-----------+----------+
// |real|categorical| raw_text|
// +----+-----------+----------+
// | 3.5| foo|[woo, woo]|
// | 5.3| bar|[baz, baz]|
// +----+-----------+----------+
val hasher = new FeatureHasher()
val result = hasher.transform(df)
result.show(false)
// +--------------------------+
// |features |
// +--------------------------+
// |(10,[3,5,6],[1.0,2.0,3.5])|
// |(10,[1,6,9],[1.0,5.3,2.0])|
// +--------------------------+
// numerical columns are handled by hashing column name to get vector index,
and using the value as the feature value
// string columns are handled by treating the features as categorical, hashing
the feature value (or "column_name=feature_value") to get index and setting
value = 1.0
// string sequence columns are handled in the same way as HashingTF currently,
i.e. same as for categorical but allowing for counts
{code}
> Extend input format that feature hashing can handle
> ---------------------------------------------------
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
> Issue Type: Sub-task
> Components: ML, MLlib
> Reporter: Nick Pentreath
> Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary,
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this
> way, feature hashing can operate as both "one-hot encoder" and "vector
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by
> {{HashingTF}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]