The program computes hashing bi-gram frequency normalized by total number
of bigrams then filter out zero values. hashing is a effective trick of
vectorizing features. Take a look at
http://en.wikipedia.org/wiki/Feature_hashing

Liquan

On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta <soumya.sima...@gmail.com>
wrote:

> I'm trying to understand the intuition behind the features method that
> Aaron used in one of his demos. I believe this feature will just work for
> detecting the character set (i.e., language used).
>
> Can someone help ?
>
>
> def featurize(s: String): Vector = {
>   val n = 1000
>   val result = new Array[Double](n)
>   val bigrams = s.sliding(2).toArray
>
>   for (h <- bigrams.map(_.hashCode % n)) {
>     result(h) += 1.0 / bigrams.length
>   }
>
>   Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
> }
>
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Reply via email to