[jira] Commented: (MAHOUT-344) Minhash based clustering

Ankur (JIRA) Wed, 29 Sep 2010 09:17:00 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916186#action_12916186
 ]


Ankur commented on MAHOUT-344:
------------------------------

Updated (and hopefully) penultimate patch with

1. All the code issues Fixed after another round of cleanup and testing.
2. Example code for conversion of LastFM dataset into Mahout vector format 
written to sequenceFiles.
3. Cluster quality evaluation code for analyzing the precision at various 
similarity thresholds.

Here are the steps to run the clustering code on Last FM dataset after applying 
the patch and building mahout jars :-

1. Download  
http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and 
uncompress and untar into a local dir.

2. Run the following command to create data converted into vector format, 
dumped into sequence files:-
       * java -Xms512m -Xmx512m -cp  
/path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar
 org.apache.mahout.clustering.minhash.LastfmDataConverter 
/path/to/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv 
/path/to/lastfm-vector-formated.seq

3. Upload file 'lastfm-vector-formated.seq' to DFS dir 'lastfm'

4. Before running the clustering MR job do the following on the shell prompt:-
        * export 
HADOOP_CLASSPATH=/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar:/path/to/commons-cli-2.0-mahout.jar.

5. Run the clustering MR job on the data. Here is a sample command line
        * hadoop jar /path/to/mahout-core-0.4-SNAPSHOT.jar  
org.apache.mahout.clustering.minhash.MinHashDriver -Dio.sort.mb=256 
-Dio.sort.factor=20 -libjars /path/to/mahout-math-0.4-SNAPSHOT.jar --input 
lastfm  --output lastfm-out --minClusterSize 10 --minVectorSize 10 --hashType 
polynomial --numHashFunctions 60 --keyGroups 2 --debugOutput true --numReducers 
1 
        * Note:- debugOutput is set to true so that entire vectors are 
clustered which later on will be used for similarity computation in quality 
evaluation.

6. Download the file under 'lastfm-out' to local dir and run the evaluation 
code as follows:-
        * java -cp 
/path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar
 org.apache.mahout.clustering.minhash.LastfmClusterEvaluator 
lastfm-cluster-data.seq 0.2 0.5
        
Here are some of the results I got with different threshold parameters and 
sampling parameters

Test Results
=============
 (A) Listeners in same cluster with simiarity above threshold (0.2) : 4997
 (B) All listeners: 15564
 Average cluster precision: A/B = 32.11 

(A) Listeners in same cluster with simiarity above threshold (0.3) : 1872
 (B) All listeners: 15564
 Average cluster precision: A/B = 12.03

The only task remaining here is updating Javadoc comments and incorporating any 
review comments. Apart from those 2 this should be good to go in.


> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.4
>
>         Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch, 
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch, MAHOUT-344-v5.patch, 
> MAHOUT-344-v6.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-344) Minhash based clustering

Reply via email to