Hello again, First of all, thank you all for taking time to answer my ideas. Based on your thoughts, I have been digging around, and the project I would very much like to propose is a MapReduce implementation of the simhash algorithm. Thank you Ted for the great idea! I'm keen on starting implementing a new algorithm for Mahout, which seems manageable and doable during the summer period and which has a clear scope. I have noticed quite a few comments, advising about not taking a project which would be to big and I intend to follow them.
After taking a deep look at the article pointed out by Ted, as well as other articles and simlar ideas (Charikar's algorithm, Ankur's patch for Minhash), I've decided to write some notes on the algorithm and a schedule for implemeting the idea during the competition. I kindly ask for your feedback on this proposal. 1. Problem and outline of simhash --------------------------------------- Detecting similar files and cliassifying documents requires complex heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm [1] relies on computing a hash function that hashes similar files to similar values. The file similarities would then be determined by comparing the pre-sorted hash key values (and reducing the complexity to O(n log n)). In order to futher improve the detection mechanism, the algorithm will also store some auxiliary data, used to compute the hash keys. This will be used as a second filter, i.e. after the hash key comparison indicates that two files are potentiallly similar. Properties for the similarity function and the algorithm: - very similar files map to very similar or even the same hash key; - distance between keys should be some measure of the difference between files. This would lead to keys proportional to file sizes and this would create false positives. Some auxiliary data will provide an easy and efficient way of refining the similarity detection. - the metric used will be a binary metric (simhash operats at byte level). - given a similarity metric, there needs to be a threshold to determine how close within the metric files need to be to count as similar. I would like my implementation of the simhash algorithm to answer two questions: - retrieving a set of files, similar to a given file; - retrieving all pairs of similar files. 2. simhash implementation ------------------------------ Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to count the occurances of these tags within the file under processing. 2.1 Computing the hash function for each file ------------------------------------------------- for each directory for each file within directory - scan through the file and look for matches for each tag; - detection window is moved one bit at a time - if a match if found - set a skip_counter and then decrement it for the following bits; - a count of matches is stored for each tag, and these are stored as SUM_TABLE_ENTRIES - the KEY for the current file is computed as a function of the SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] - after this processing, we store: | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | The key function could also be implemented to take into account file extensions. 2.2 Finding similarities ---------------------------- for each FILE SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); for each SIMILAR_FILE if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) discard the query else compute the distance between SIMLAR_FILE and FILE ' s SUM_TABLE_ENTRIES (i.e. "the sum of the absolute values of the diffs between their entries) if the computed distance is within a tollerance or is equal to 0 then "The files are identical" 3. Distributing simhash on hadoop ------------------------------------- The algorithm above is very suited for a MapReduce implementation. 1. Map In this phase we do the computation of the hash for a file. It outputs (File, simhash_key). 2. Reduce Once every item has a simhash, group everyfile with the same simhash into a cluster, sorted by the simhashkey (the key is an integer [#1]) . QUESTIONS SO FAR: 1. For the distributed version, I believe there should be a prepocessing phase where data about files should be stored on HDFS, in a proper vector format ? Would it be better to store this in HBase, Cassandra or something similar ? 2. Would it be better in the reduce phase to actually eliminate the duplicates and create a single sorted-by-hash file ? 4. Implementation schedule ------------------------------ *** May 24 - Jul 12: * Week 1: write documentation for the algorithm on the wiki and start planning the objects would interact (classes, methods etc.). * Week 2,3,4: write the algorithm for hashing of the files. * Week 5: Thoroughly test * Week 6: Document the MapReduce implementation - 2 days. Organize the data structures 3 - days. * Week 7,8,9: Write the distributed MapReduce Implementation of the algorithm along with tests. Test as much as possible on different datasets. *** Jul 16 - Aug 9: * Week 10, 11, 12, 13: Continue with tests on different datasets for the distributed implementation. Also if time permits, I would like to make graphs and publish results with the carried tests. I'm very open to suggestions regarding this plan, so any observation is highly appreciated. Thank you again for taking time to read this. Regards, Cristi. [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, December 13, 2007. On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote: > Minhash clustering is important for duplicate detection. You can also > do simhash > clustering<http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which > might be simpler to implement. I can imagine an implementation where > the map generates the simhash and emits multiple copies keyed on part of the > hash. Near duplicates would thus be grouped together in the reduce where an > in-memory duplicate detection could be done. > > Sean's intuitions about project size are generally better than mine, but if > you limit you are strict about not increasing scope, I think you could > succeed with this project. > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan > <prodan.crist...@gmail.com>wrote: > >> IDEA 1 - MinHash clustering >> --------------------------- >> The first idea come after taking a look at Google 's paper on collaborative >> filtering for their news system[2]. In that paper, I looked at MinHash >> clustering. >> My first question is: is MinHash clustering considered cool ? If yes, than >> I would like to take a stab at implementing it. >> The paper also describes the implementation in a MapReduce style. Since >> this is only a suggestion I will not elaborate very much on the solution >> now. I would like to ask you weather this might be considered a good choice >> (i.e. important for the framework to have something like this) and if this >> is a big enough project. >>