----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/20608/#review41559 -----------------------------------------------------------
This is pretty cool, nicely done! I still need to read the paper to understand the theory behind it but the code looks good overall. datafu-pig/src/main/java/datafu/pig/hash/lsh/CosineDistanceHash.java <https://reviews.apache.org/r/20608/#comment75043> In practice do you think we probably don't want to provide a seed? It seems like if a seed is used every mapper instance will get the same RNG initialized with the same seed and therefore generate the same random numbers to initialize the hash functions. Should we make the seed optional? I can see it being useful for unit testing though. datafu-pig/src/main/java/datafu/pig/hash/lsh/CosineDistanceHash.java <https://reviews.apache.org/r/20608/#comment75033> Is it possible that a point could appear in more than one family by chance? Would we need to apply distinct to be safe? This question applies to the other examples as well. datafu-pig/src/main/java/datafu/pig/hash/lsh/L1PStableHash.java <https://reviews.apache.org/r/20608/#comment75042> doc: reword this sentence datafu-pig/src/main/java/datafu/pig/hash/lsh/L1PStableHash.java <https://reviews.apache.org/r/20608/#comment75044> typo: L1, not Cosine datafu-pig/src/main/java/datafu/pig/hash/lsh/L2PStableHash.java <https://reviews.apache.org/r/20608/#comment75045> typo: L1 => L2 datafu-pig/src/main/java/datafu/pig/hash/lsh/L2PStableHash.java <https://reviews.apache.org/r/20608/#comment75046> reword this too datafu-pig/src/main/java/datafu/pig/hash/lsh/L2PStableHash.java <https://reviews.apache.org/r/20608/#comment75047> typo: L2StableHash datafu-pig/src/main/java/datafu/pig/hash/lsh/L2PStableHash.java <https://reviews.apache.org/r/20608/#comment75048> typo: should be L2 distance datafu-pig/src/main/java/datafu/pig/hash/lsh/LSHFunc.java <https://reviews.apache.org/r/20608/#comment75032> would be good to validate the input schema datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/AbstractStableDistributionFunction.java <https://reviews.apache.org/r/20608/#comment75036> typo: with a => are datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/AbstractStableDistributionFunction.java <https://reviews.apache.org/r/20608/#comment75037> typo: of for datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/L2LSH.java <https://reviews.apache.org/r/20608/#comment75038> typo: L2 and 2-stable datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/L2LSH.java <https://reviews.apache.org/r/20608/#comment75040> typo: 2-stable datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/L2LSH.java <https://reviews.apache.org/r/20608/#comment75041> typo: gaussian datafu-pig/src/test/java/datafu/test/pig/hash/lsh/LSHPigTest.java <https://reviews.apache.org/r/20608/#comment75049> some of the indentation could be cleaned up here - Matthew Hayes On April 23, 2014, 1:14 p.m., Casey Stella wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/20608/ > ----------------------------------------------------------- > > (Updated April 23, 2014, 1:14 p.m.) > > > Review request for DataFu. > > > Repository: datafu > > > Description > ------- > > From DATAFU-37: Create a set of UDFs to implement Locality Sensitive Hashing > in support of finding k-near neighbors. Initially, hashes associated with L1, > L2 and Cosine similarity should be supported. > > > Diffs > ----- > > datafu-pig/src/main/java/datafu/pig/hash/lsh/CosineDistanceHash.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/L1PStableHash.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/L2PStableHash.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/LSHFamily.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/LSHFunc.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/RepeatingLSH.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/cosine/HyperplaneLSH.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/cosine/package-info.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/interfaces/LSH.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/interfaces/LSHCreator.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/interfaces/Sampler.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/interfaces/package-info.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/metric/Cosine.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/metric/L1.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/metric/L2.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/metric/MetricUDF.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/metric/package-info.java > PRE-CREATION > > datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/AbstractStableDistributionFunction.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/L1LSH.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/L2LSH.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/p_stable/package-info.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/package-info.java PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/util/DataTypeUtil.java > PRE-CREATION > datafu-pig/src/main/java/datafu/pig/hash/lsh/util/package-info.java > PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/hash/lsh/LSHPigTest.java > PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/hash/lsh/LSHTest.java PRE-CREATION > > Diff: https://reviews.apache.org/r/20608/diff/ > > > Testing > ------- > > 2 unit tests. One pigunit for the UDFs and one regular JUnit test to test > functionality. > > > Thanks, > > Casey Stella > >