The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing
Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta <soumya.sima...@gmail.com> wrote: > I'm trying to understand the intuition behind the features method that > Aaron used in one of his demos. I believe this feature will just work for > detecting the character set (i.e., language used). > > Can someone help ? > > > def featurize(s: String): Vector = { > val n = 1000 > val result = new Array[Double](n) > val bigrams = s.sliding(2).toArray > > for (h <- bigrams.map(_.hashCode % n)) { > result(h) += 1.0 / bigrams.length > } > > Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) > } > > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst