[
https://issues.apache.org/jira/browse/MAHOUT-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667991#comment-13667991
]
Suneel Marthi commented on MAHOUT-365:
--------------------------------------
I had to recently implement Simhash clustering at work (for determining
document similarity and to generate unique fingerprints for a document). Would
it be useful to reconsider having Simhash clustering in Mahout?
> [GSoC] Proposal to implement SimHash clustering on MapReduce
> ------------------------------------------------------------
>
> Key: MAHOUT-365
> URL: https://issues.apache.org/jira/browse/MAHOUT-365
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Cristi Prodan
> Labels: gsoc, gsoc2010, proposal
>
> Application for Google Summer of Code 2010 - Mahout Project
> Student: Cristian Prodan
> 1. Synopsis
> I will add a map-reduce implementation of the SimHash clustering algorithm to
> the Mahout project. This algorithm provides an efficient way of finding
> similar/identical files in a large collection of files.
> 2. Project
> Storage capacities become larger and thus it is difficult to organize and
> manage growing file systems. It is easy to loose track of identical copies or
> older versions of the files in a directory structure. Duplicate detection
> technologies do not extend well when files are not identical. A typical idea
> for detecting similar files is to see the features of a file into a
> high-dimensional space and then use distance within space as a measure of
> similarity. This may not be feasible since it involves a O(n^2) complexity of
> the algorithm. If these file-to-vector mappings are reduced ow a
> one-dimensional space, then the data points could be sorted in O(n log n)
> time - a big increase of detection speed.
> I will implement the SimHash algorithm presented in detail in [1]. The idea
> is the following: using a hash function that hashed similar files to similar
> values, file similarity could be determined simply by comparing pre-sorted
> hash key values.
> I will implement a family of similarity hash functions which will do this, as
> described in [1]. Furthermore, the performance will be enhanced by storing
> auxiliary data used to compute the hash keys. This data will be used as a
> second filter after a hash key comparison indicates that two files are
> potentially similar.
> Properties for the similarity function and the algorithm:
> - very similar files map to very similar or even the same hash key;
> - distance between keys should be some measure of the difference between
> files. This would lead to keys proportional to file sizes and this would
> create false positives. The auxiliary data mentioned above will provide an
> easy and efficient way of refining the similarity detection.
> - the metric used will be a binary metric (simhash operates at byte level).
> - given a similarity metric, there needs to be a threshold to determine how
> close within the metric files need to be to count as similar.
> From a distributed point of view, the algorithm above is very suited for a
> MapReduce implementation. The sketch is the following:
> I. MAP phase
> In this phase we compute the hash for a file along with additional info which
> serves as a second filter in similarity detection.
> It outputs (File, simhash_key).
> II. REDUCE phase
> Once every file has a simhash, group every file with the same simhash into a
> cluster, sorted by the simhashkey (the key is an integer) .
> * The similarity check can be done in main memory.
> 3. Deliverables:
> 1. Batch for sim-hashing the files. The hashes will be stored either on
> normal file system (for small tests), HDFS or on HBase.
> 2. A tool for answering the next type of queries:
> - retrieving a set of files, similar to a given file;
> - retrieving all pairs of similar files.
> There will be the option to run this as a standalone program, for tests or
> smaller scale purposes.
> 3. Documentation and unit tests for the written code.
> 4. Getting started tutorials for SimHash.
> 5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.
> 4. Benefits for the Mahout community:
> 1) A distributed tool for efficient similarity detection of files in large
> datasets. Algorithms for detecting similar files can also be useful for
> classification purposes and as an aid to search. Next release of Mahout will
> contain this functionality.
> 2) This will be used at smaller scale also, by running it in an
> "urn-distributed" mode, for small scale datasets.
> 3) Good tutorials on how to work with this tool.
> 5. Roadmap
> 1). Community Bonding Period (21th April to 24rd May)
> - 1 week: Familiarize with Mahout, hadoop, and the existing clustering
> algorithms;
> - 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and
> other interfaces and classes used in Mahout) and start initial planning of
> the system in terms of interactions and data structures.
> - 1 week: Setup a hadoop cluster formed of 2-3 nodes;
> - During this time, I plan to speak with my mentor and ask him very specific
> questions about the plans I make for the implementation.
> 2). May 24th to Jul 12th
> - Week 1: Detailed planning on how the objects would interact (classes,
> methods, the Vector interfaces I plan to use etc.) and ask feedback from the
> mentor.
> - Week 2: Write the code for hashing files (a batch process which will work
> as an indexer), according to the algorithm, in a TDD style. At the end tests
> will accompany the code. The results will be stored on normal file system or
> HDFS.
> - Week 3: Half week: Test the program in "single" mode (urn-distributed).
> Start interacting with HBase if time permits.
> - Week 4: Interact with HBase; The algorithm will store the hashes in HBase
> for fast retrieval.
> - Week 5: Test the algorithm on different datasets: local file system,
> 20newsgroups and wikipedia.
> - Week 6, 7: Add the distributed MapReduce implementation (based on hadoop)
> and write unit tests for it;
> - Week 8: Continue with unit tests and test the distributed implementation on
> the data sets for local file system, HDFS and HBase. Fix outstanding bugs.
> Analyze results.
> - Week 9: Continue with testing and bug fixing and make sure everything is
> ready for mid-term evaluation;
> Mid-term evaluation period.
> 3). Jul 16th - Aug 9th
> - Week 10: Extend the batch tool with different options (different hash
> functions families, threshold for similarity). Experiment with different hash
> functions families; Give the user an option to specify hash functions.
> - Week 11: Finish off the demos for the 20newsgroups and wikipedia datasets.
> Write some tests for them. Publish the results from the demos on the wiki.
> - Week 12: Write the getting started guides on how to use the SimHash tool;
> - Week 13: Finish off everything, fix outstanding bugs, clean and document
> undocumented code.
> 6. Biography
> My name is Cristi Prodan, I am 23 years old and currently a 2nd year student
> pursuing a MSc degree in Computer Science. During the past year, I have been
> interested in the study of machine learning mainly Recommender Systems and
> clustering techniques. I have heard about hadoop and Mahout and I found
> challenging the idea of working with them if I ever have the chance. My
> dissertation paper on Recommender Systems and I will use Mahout at it's core.
> Since I heard that The Apache Software Foundation is willing to participate
> in this year edition of Google Summer of Code, I was eager to try my hand at
> contributing to the Mahout project. Being a subscriber to the list since
> December 2009, I started to interact with the community and presenting my
> ideas. I was mostly interested in the MinHash algorithm [2] (which may also
> be used for similarity detection), whose implementation was started by a
> contributor. After analyzing his code and seeking for advice from the other
> members, I have committed my first patch to Mahout (on JIRA, MAHOUT-344).
> During this time I have also skimmed through the wiki and [3].
> At university, I have extensively worked with Java. I am also a big fan of
> Ruby, Python and currently started digging into Erlang. In the past two
> years, in the free time, I did some freelancing, working on various web
> applications.
> 7. References
> [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection,
> December 13, 2007 (Manning MEAP).
> [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News
> Personalization: Scalable Online Collaborative Filtering, WWW 2007.
> [3] Owen, Anil - Mahout in Action. Manning, 2010.
> ------------------------------------------------------
> This proposal is partly inspired by the one of Konstantin Kafer
> [http://drupal.org/files/application.pdf]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira