Hello again,

First of all, thank you all for taking time to answer my ideas. Based on your 
thoughts, I have been digging around, and  the project I would very much like 
to propose is a MapReduce implementation of the simhash algorithm. Thank you 
Ted for the great idea! I'm keen on starting implementing a new algorithm for 
Mahout, which seems manageable and doable during the summer period and which 
has a clear scope. I have noticed quite a few comments, advising about not 
taking a project which would be to big and I intend to follow them. 

After taking a deep look at the article pointed out by Ted, as well as other 
articles and simlar ideas (Charikar's algorithm, Ankur's patch for Minhash), 
I've decided to write some notes on the algorithm and a schedule for 
implemeting the idea during the competition.  I kindly ask for your feedback on 
this proposal. 


1. Problem and outline of simhash
---------------------------------------

Detecting similar files and cliassifying documents requires complex heuristics 
and/or O(n^2) pair-wise computations [1]. The simhash algorithm [1] relies on 
computing a hash function that hashes similar files to similar values. The file 
similarities would then be determined by comparing the pre-sorted hash key 
values (and reducing the complexity to O(n log n)). 
In order to futher improve the detection mechanism, the algorithm will also 
store some auxiliary data, used to compute the hash keys. This will be used as 
a second filter, i.e. after the hash key comparison indicates that two files 
are potentiallly similar.

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between files. 
This would lead to keys proportional to file sizes and this would create false 
positives. Some auxiliary data will provide an easy and efficient way of 
refining the similarity detection.  
- the metric used will be a binary metric (simhash operats at byte level). 
- given a similarity metric, there needs to be a threshold to determine how 
close within the metric files need to be to count as similar. 

I would like my implementation of the simhash algorithm to answer two questions:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.


2. simhash implementation
------------------------------
Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to 
count the occurances of these tags within the file under processing. 


2.1 Computing the hash function for each file
-------------------------------------------------
for each directory
  for each file within directory
     - scan through the file and look for matches for each tag;
     - detection window is moved one bit at a time
     - if a match if found
        - set a skip_counter and then decrement it for the following bits;
     - a count of matches is stored for each tag, and these are stored as 
SUM_TABLE_ENTRIES
     - the KEY for the current file is computed as a function of the 
SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
     - after this processing, we store: 
       | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
    
The key function could also be implemented to take into account file 
extensions. 


2.2 Finding similarities
----------------------------
for each FILE 
   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); 
   for each SIMILAR_FILE
        if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
              discard the query
         else 
              compute the distance between SIMLAR_FILE and FILE ' s 
SUM_TABLE_ENTRIES
              (i.e. "the sum of the absolute values of the diffs between their 
entries)
              if the computed distance is within a tollerance or is equal to 0 
then
                   "The files are identical"
     


3. Distributing simhash on hadoop
-------------------------------------
The algorithm above is very suited for a MapReduce implementation. 

1. Map
In this phase we do the computation of the hash for a file. 
It outputs (File, simhash_key).

2. Reduce
Once every item has a simhash, group everyfile with the same simhash into a 
cluster, sorted by the simhashkey (the key is an integer [#1]) . 

QUESTIONS SO FAR:
1. For the distributed version, I believe there should be a prepocessing phase 
where data about files should be stored on HDFS, in a proper vector format ? 
Would it be better to store this in HBase, Cassandra or something similar ?
2. Would it be better in the reduce phase to actually eliminate the duplicates 
and create a single sorted-by-hash file ?


4. Implementation schedule
------------------------------

*** May 24 - Jul 12:
* Week 1: write documentation for the algorithm on the wiki and start planning 
the objects would interact (classes, methods etc.). * Week 2,3,4: write the 
algorithm for hashing of the files. 
* Week 5: Thoroughly test 
* Week 6: Document the MapReduce implementation - 2 days. Organize the data 
structures 3 - days.
* Week 7,8,9: Write the distributed MapReduce Implementation of the algorithm 
along with tests. Test as much as possible on different datasets.

*** Jul 16 - Aug 9:
* Week 10, 11, 12, 13: Continue with tests on different datasets for the 
distributed implementation. Also if time permits, I would like to make graphs 
and publish results with the carried tests.

I'm very open to suggestions regarding this plan, so any observation is highly 
appreciated.

Thank you again for taking time to read this. 

Regards,
Cristi. 


[1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, 
December 13, 2007.
On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote:

> Minhash clustering is important for duplicate detection.  You can also
> do simhash
> clustering<http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which
> might be simpler to implement.  I can imagine an implementation where
> the map generates the simhash and emits multiple copies keyed on part of the
> hash.  Near duplicates would thus be grouped together in the reduce where an
> in-memory duplicate detection could be done.
> 
> Sean's intuitions about project size are generally better than mine, but if
> you limit you are strict about not increasing scope, I think you could
> succeed with this project.
> 
> On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan 
> <prodan.crist...@gmail.com>wrote:
> 
>> IDEA 1 - MinHash clustering
>> ---------------------------
>> The first idea come after taking a look at Google 's paper on collaborative
>> filtering for their news system[2]. In that paper, I looked at MinHash
>> clustering.
>> My first question is: is MinHash clustering considered cool ? If yes, than
>> I would like to take a stab at implementing it.
>> The paper also describes the implementation in a MapReduce style. Since
>> this is only a suggestion I will not elaborate very much on the solution
>> now. I would like to ask you weather this might be considered a good choice
>> (i.e. important for the framework to have something like this) and if this
>> is a big enough project.
>> 

Reply via email to