[jira] [Comment Edited] (SPARK-13969) Extend input format that feature hashing can handle

Nick Pentreath (JIRA) Sat, 19 Mar 2016 12:14:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201257#comment-15201257
 ]


Nick Pentreath edited comment on SPARK-13969 at 3/18/16 10:00 AM:
------------------------------------------------------------------

What I have in mind is something like the following:

{code}
// class FeatureHasher extends Transformer ...

val df = sqlContext.createDataFrame(Seq(
    (3.5, "foo", Seq("woo", "woo")), 
    (5.3, "bar", Seq("baz", "baz")))).toDF("real", "categorical", "raw_text")
df.show
// +----+-----------+----------+
// |real|categorical|  raw_text|
// +----+-----------+----------+
// | 3.5|        foo|[woo, woo]|
// | 5.3|        bar|[baz, baz]|
// +----+-----------+----------+

val hasher = new FeatureHasher()
val result = hasher.transform(df)
result.show(false)
// +--------------------------+
// |features                  |
// +--------------------------+
// |(10,[3,5,6],[1.0,2.0,3.5])|
// |(10,[1,6,9],[1.0,5.3,2.0])|
// +--------------------------+

// numerical columns are handled by hashing column name to get vector index, 
and using the value as the feature value
// string columns are handled by treating the features as categorical, hashing 
the feature value (or "column_name=feature_value") to get index and setting 
value = 1.0
// string sequence columns are handled in the same way as HashingTF currently, 
i.e. same as for categorical but allowing for counts
{code}

[~mengxr] [~josephkb] would like to get your thoughts on this.


was (Author: mlnick):
What I have in mind is something like the following:

{code}
// class FeatureHasher extends Transformer ...

val df = sqlContext.createDataFrame(Seq(
    (3.5, "foo", Seq("woo", "woo")), 
    (5.3, "bar", Seq("baz", "baz")))).toDF("real", "categorical", "raw_text")
df.show
// +----+-----------+----------+
// |real|categorical|  raw_text|
// +----+-----------+----------+
// | 3.5|        foo|[woo, woo]|
// | 5.3|        bar|[baz, baz]|
// +----+-----------+----------+

val hasher = new FeatureHasher()
val result = hasher.transform(df)
result.show(false)
// +--------------------------+
// |features                  |
// +--------------------------+
// |(10,[3,5,6],[1.0,2.0,3.5])|
// |(10,[1,6,9],[1.0,5.3,2.0])|
// +--------------------------+

// numerical columns are handled by hashing column name to get vector index, 
and using the value as the feature value
// string columns are handled by treating the features as categorical, hashing 
the feature value (or "column_name=feature_value") to get index and setting 
value = 1.0
// string sequence columns are handled in the same way as HashingTF currently, 
i.e. same as for categorical but allowing for counts
{code}

> Extend input format that feature hashing can handle
> ---------------------------------------------------
>
>                 Key: SPARK-13969
>                 URL: https://issues.apache.org/jira/browse/SPARK-13969
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Nick Pentreath
>            Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-13969) Extend input format that feature hashing can handle

Reply via email to