[ 
https://issues.apache.org/jira/browse/MAHOUT-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667991#comment-13667991
 ] 

Suneel Marthi commented on MAHOUT-365:
--------------------------------------

I had to recently implement Simhash clustering at work (for determining 
document similarity and to generate unique fingerprints for a document). Would 
it be useful to reconsider having Simhash clustering in Mahout? 
                
> [GSoC] Proposal to implement SimHash clustering on MapReduce
> ------------------------------------------------------------
>
>                 Key: MAHOUT-365
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-365
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Cristi Prodan
>              Labels: gsoc, gsoc2010, proposal
>
> Application for Google Summer of Code 2010 - Mahout Project
> Student: Cristian Prodan
> 1. Synopsis
> I will add a map-reduce implementation of the SimHash clustering algorithm to 
> the Mahout project. This algorithm provides an efficient way of finding 
> similar/identical files in a large collection of files. 
> 2. Project
> Storage capacities become larger and thus it is difficult to organize and 
> manage growing file systems. It is easy to loose track of identical copies or 
> older versions of the files in a directory structure. Duplicate detection 
> technologies do not extend well when files are not identical. A typical idea 
> for detecting similar files is to see the features of a file into a 
> high-dimensional space and then use distance within space as a measure of 
> similarity. This may not be feasible since it involves a O(n^2) complexity of 
> the algorithm. If these file-to-vector mappings are reduced ow a 
> one-dimensional space, then the data points could be sorted in O(n log n) 
> time - a big increase of detection speed. 
> I will implement the SimHash algorithm presented in detail in [1]. The idea 
> is the following: using a hash function that hashed similar files to similar 
> values, file similarity could be determined simply by comparing pre-sorted 
> hash key values. 
> I will implement a family of similarity hash functions which will do this, as 
> described in [1]. Furthermore, the performance will be enhanced by storing 
> auxiliary data used to compute the hash keys. This data will be used as a 
> second filter after a hash key comparison indicates that two files are 
> potentially similar. 
> Properties for the similarity function and the algorithm:
> - very similar files map to very similar or even the same hash key;
> - distance between keys should be some measure of the difference between 
> files. This would lead to keys proportional to file sizes and this would 
> create false positives. The auxiliary data mentioned above will provide an 
> easy and efficient way of refining the similarity detection.  
> - the metric used will be a binary metric (simhash operates at byte level). 
> - given a similarity metric, there needs to be a threshold to determine how 
> close within the metric files need to be to count as similar. 
> From a distributed point of view, the algorithm above is very suited for a 
> MapReduce implementation. The sketch is the following:
> I. MAP phase
> In this phase we compute the hash for a file along with additional info which 
> serves as a second filter in similarity detection. 
> It outputs (File, simhash_key).
> II. REDUCE phase
> Once every file has a simhash, group every file with the same simhash into a 
> cluster, sorted by the simhashkey (the key is an integer) . 
> * The similarity check can be done in main memory. 
> 3. Deliverables:
> 1. Batch for sim-hashing the files. The hashes will be stored either on 
> normal file system (for small tests), HDFS or on HBase.  
> 2. A tool for answering the next type of queries:
> - retrieving a set of files, similar to a given file;
> - retrieving all pairs of similar files.
> There will be the option to run this as a standalone program, for tests or 
> smaller scale purposes.
> 3. Documentation and unit tests for the written code. 
> 4. Getting started tutorials for SimHash.
> 5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.
> 4. Benefits for the Mahout community:
> 1) A distributed tool for efficient similarity detection of files in large 
> datasets. Algorithms for detecting similar files can also be useful for 
> classification purposes and as an aid to search. Next release of Mahout will 
> contain this functionality. 
> 2) This will be used at smaller scale also, by running it in an 
> "urn-distributed" mode, for small scale datasets.
> 3) Good tutorials on how to work with this tool. 
> 5. Roadmap
> 1).  Community Bonding Period (21th April to 24rd May)
> - 1 week: Familiarize with Mahout, hadoop, and the existing clustering 
> algorithms;
> - 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and 
> other interfaces and classes used in Mahout) and start initial planning of 
> the system in terms of interactions and data structures. 
> - 1 week: Setup a hadoop cluster formed of 2-3 nodes;
> - During this time, I plan to speak with my mentor and ask him very specific 
> questions about the plans I make for the implementation. 
> 2).  May 24th to Jul 12th
> - Week 1: Detailed planning on how the objects would interact (classes, 
> methods, the Vector interfaces I plan to use etc.) and ask feedback from the 
> mentor. 
> - Week 2: Write the code for hashing files (a batch process which will work 
> as an indexer), according to the algorithm, in a TDD style. At the end tests 
> will accompany the code. The results will be stored on normal file system or 
> HDFS.
> - Week 3: Half week: Test the program in "single" mode (urn-distributed). 
> Start interacting with HBase if time permits. 
> - Week 4: Interact with HBase; The algorithm will store the hashes in HBase 
> for fast retrieval. 
> - Week 5: Test the algorithm on different datasets: local file system, 
> 20newsgroups and wikipedia.
> - Week 6, 7: Add the distributed MapReduce implementation (based on hadoop) 
> and write unit tests for it;
> - Week 8: Continue with unit tests and test the distributed implementation on 
> the data sets for local file system, HDFS and HBase. Fix outstanding bugs. 
> Analyze results. 
> - Week 9: Continue with testing and bug fixing and make sure everything is 
> ready for mid-term evaluation;
> Mid-term evaluation period.
> 3). Jul 16th - Aug 9th
> - Week 10: Extend the batch tool with different options (different hash 
> functions families, threshold for similarity). Experiment with different hash 
> functions families; Give the user an option to specify hash functions. 
> - Week 11: Finish off the demos for the 20newsgroups and wikipedia datasets. 
> Write some tests for them. Publish the results from the demos on the wiki.
> - Week 12: Write the getting started guides on how to use the SimHash tool;
> - Week 13: Finish off everything, fix outstanding bugs, clean and document 
> undocumented code.  
> 6. Biography 
> My name is Cristi Prodan, I am 23 years old and currently a 2nd year student 
> pursuing a MSc degree in Computer Science. During the past year, I have been 
> interested in the study of machine learning mainly Recommender Systems and 
> clustering techniques. I have heard about hadoop and Mahout and I found 
> challenging the idea of working with them  if I ever have the chance. My 
> dissertation paper on Recommender Systems and I will use Mahout at it's core. 
> Since I heard that The Apache Software Foundation is willing to participate  
> in this year edition of Google Summer of Code, I was eager to try my hand at 
> contributing to the Mahout project. Being a subscriber to the list since 
> December 2009, I started to interact with the community and presenting my 
> ideas.  I was mostly interested in the MinHash algorithm [2] (which may also 
> be used for similarity detection), whose implementation was started by a 
> contributor. After analyzing his code and seeking for advice from the other 
> members, I have committed my first patch to Mahout (on JIRA, MAHOUT-344).
> During this time I have also skimmed through the wiki and [3].
> At university, I have extensively worked with Java. I am also a big fan of 
> Ruby, Python and currently started digging into Erlang. In the past two 
> years, in the free time, I did some freelancing, working on various web 
> applications. 
> 7. References
> [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, 
> December 13, 2007 (Manning MEAP).
> [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
> Personalization: Scalable Online Collaborative Filtering, WWW 2007.
> [3] Owen, Anil - Mahout in Action. Manning, 2010. 
> ------------------------------------------------------
> This proposal is partly inspired by the one of Konstantin Kafer 
> [http://drupal.org/files/application.pdf]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to