[
https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537321#comment-14537321
]
ASF GitHub Bot commented on FLINK-1735:
---------------------------------------
Github user aalexandrov commented on a diff in the pull request:
https://github.com/apache/flink/pull/665#discussion_r30005092
--- Diff:
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/extraction/FeatureHasher.scala
---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.feature.extraction
+
+import java.nio.charset.Charset
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
+import org.apache.flink.ml.feature.extraction.FeatureHasher.{NonNegative,
NumFeatures}
+import org.apache.flink.ml.math.{Vector, SparseVector}
+
+import scala.util.hashing.MurmurHash3
+
+
+/** This transformer turns sequences of symbolic feature names (strings)
into
+ * flink.ml.math.SparseVectors, using a hash function to compute the
matrix column corresponding
+ * to a name. Aka the hashing trick.
+ * The hash function employed is the signed 32-bit version of Murmurhash3.
+ *
+ * By default for [[FeatureHasher]] transformer numFeatures=2#94;20 and
nonNegative=false.
+ *
+ * This transformer takes a [[Seq]] of strings and maps it to a
+ * feature [[Vector]].
+ *
+ * This transformer can be prepended to all [[Transformer]] and
+ * [[org.apache.flink.ml.common.Learner]] implementations which expect an
input of
+ * [[Vector]].
+ *
+ * @example
+ * {{{
+ * val trainingDS: DataSet[Seq[String]] =
env.fromCollection(data)
+ * val transformer =
FeatureHasher().setNumFeatures(65536).setNonNegative(false)
+ *
+ * transformer.transform(trainingDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[FeatureHasher.NumFeatures]]: The number of features (entries) in
the output vector;
+ * by default equal to 2^20
+ * - [[FeatureHasher.NonNegative]]: Whether output vector should contain
non-negative values only.
+ * When True, output values can be interpreted as frequencies. When
False, output values will have
+ * expected value zero; by default equal to false
+ */
+class FeatureHasher extends Transformer[Seq[String], Vector] with
Serializable {
+
+ // The seed used to initialize the hasher
+ val Seed = 0
+
+ /** Sets the number of features (entries) in the output vector
+ *
+ * @param numFeatures the user-specified numFeatures value. In case the
user gives a value less
+ * than 1, numFeatures is set to its default value:
2^20
+ * @return the FeatureHasher instance with its numFeatures value set to
the user-specified value
+ */
+ def setNumFeatures(numFeatures: Int): FeatureHasher = {
+ // number of features must be greater zero
+ if(numFeatures < 1) {
+ return this
--- End diff --
This might cause a small debugging hell. Throw a `RuntimeException` or at
least log a `WARN` message here.
> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
> Key: FLINK-1735
> URL: https://issues.apache.org/jira/browse/FLINK-1735
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Felix Neutatz
> Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature
> values. The hash of the feature value is used to calculate its index for a
> vector entry. In order to mitigate possible collisions, a second hashing
> function is used to calculate the sign for the update value which is added to
> the vector entry. This way, it is likely that collision will simply cancel
> out.
> A feature hasher would also be helpful for NLP problems where it could be
> used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2]
> [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)