> On Aug. 30, 2014, 4:30 a.m., wang jian wrote: > > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java, line 78 > > <https://reviews.apache.org/r/25049/diff/1/?file=668674#file668674line78> > > > > could you please share the tutorial that describes the algorithm? Are > > there any other SimHash algorithms we could also support?
http://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ There are other variants which we can iterate on as SimHash v2. > On Aug. 30, 2014, 4:30 a.m., wang jian wrote: > > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java, line 93 > > <https://reviews.apache.org/r/25049/diff/1/?file=668674#file668674line93> > > > > It seems that here only tri-grams are used instead of n-gram generated, > > input parameter "n" is not used in this function? Should we use a sort of > > sliding window to implement this? > > > > private List<String> computeNGramShingles(String line, int n) { > > > > List<String> result = new ArrayList<String>(n); > > > > String[] circularQueue = new String[n]; > > StringTokenizer st = new StringTokenizer(line); > > > > int index = 0; > > int circularQueueSize = 0; > > > > StringBuffer strBuf = new StringBuffer(); > > > > while (st.hasMoreElements()) { > > String token = st.nextToken(); > > if (circularQueueSize == n) > > { > > strBuf.setLength(0); > > for(int pn = 0; pn < n; pn++) > > { > > if (pn > 0) > > { > > strBuf.append(" "); > > } > > strBuf.append(circularQueue[(index + pn) % n]); > > } > > result.add(strBuf.toString()); > > index = (index + 1) % n; > > circularQueueSize--; > > } > > circularQueue[(index + circularQueueSize) % n] = token; > > if (circularQueueSize < n) > > { > > circularQueueSize++; > > } > > } > > > > if (circularQueueSize == n) > > { > > strBuf.setLength(0); > > for(int pn = 0; pn < n; pn++) > > { > > if (pn > 0) > > { > > strBuf.append(" "); > > } > > strBuf.append(circularQueue[(index + pn) % n]); > > } > > result.add(strBuf.toString()); > > } > > > > return result; > > } > > > > The complete test class: > > https://github.com/king821221/coding/blob/master/NGram.java Added NGram function instead of 3-gram. - Mohammad ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25049/#review51486 ----------------------------------------------------------- On Sept. 3, 2014, 8:55 p.m., Mohammad Amin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25049/ > ----------------------------------------------------------- > > (Updated Sept. 3, 2014, 8:55 p.m.) > > > Review request for DataFu. > > > Repository: datafu > > > Description > ------- > > DATAFU-67. Adding Simple SimHash to compute near duplicates. > https://issues.apache.org/jira/browse/DATAFU-67 > > > Diffs > ----- > > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/hash/HashTests.java 7ff8fb9 > > Diff: https://reviews.apache.org/r/25049/diff/ > > > Testing > ------- > > Unit tests passed. > > > Thanks, > > Mohammad Amin > >