[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916186#action_12916186
]
Ankur commented on MAHOUT-344:
------------------------------
Updated (and hopefully) penultimate patch with
1. All the code issues Fixed after another round of cleanup and testing.
2. Example code for conversion of LastFM dataset into Mahout vector format
written to sequenceFiles.
3. Cluster quality evaluation code for analyzing the precision at various
similarity thresholds.
Here are the steps to run the clustering code on Last FM dataset after applying
the patch and building mahout jars :-
1. Download
http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz and
uncompress and untar into a local dir.
2. Run the following command to create data converted into vector format,
dumped into sequence files:-
* java -Xms512m -Xmx512m -cp
/path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar
org.apache.mahout.clustering.minhash.LastfmDataConverter
/path/to/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv
/path/to/lastfm-vector-formated.seq
3. Upload file 'lastfm-vector-formated.seq' to DFS dir 'lastfm'
4. Before running the clustering MR job do the following on the shell prompt:-
* export
HADOOP_CLASSPATH=/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar:/path/to/commons-cli-2.0-mahout.jar.
5. Run the clustering MR job on the data. Here is a sample command line
* hadoop jar /path/to/mahout-core-0.4-SNAPSHOT.jar
org.apache.mahout.clustering.minhash.MinHashDriver -Dio.sort.mb=256
-Dio.sort.factor=20 -libjars /path/to/mahout-math-0.4-SNAPSHOT.jar --input
lastfm --output lastfm-out --minClusterSize 10 --minVectorSize 10 --hashType
polynomial --numHashFunctions 60 --keyGroups 2 --debugOutput true --numReducers
1
* Note:- debugOutput is set to true so that entire vectors are
clustered which later on will be used for similarity computation in quality
evaluation.
6. Download the file under 'lastfm-out' to local dir and run the evaluation
code as follows:-
* java -cp
/path/to/commons-logging-1.0.4.jar:/path/to/log4j-1.2.15.jar:/path/to/hadoop-0.20.2-core.jar:/path/to/mahout-math-0.4-SNAPSHOT.jar:/path/to/mahout-examples-0.4-SNAPSHOT.jar:/path/to/mahout-core-0.4-SNAPSHOT.jar
org.apache.mahout.clustering.minhash.LastfmClusterEvaluator
lastfm-cluster-data.seq 0.2 0.5
Here are some of the results I got with different threshold parameters and
sampling parameters
Test Results
=============
(A) Listeners in same cluster with simiarity above threshold (0.2) : 4997
(B) All listeners: 15564
Average cluster precision: A/B = 32.11
(A) Listeners in same cluster with simiarity above threshold (0.3) : 1872
(B) All listeners: 15564
Average cluster precision: A/B = 12.03
The only task remaining here is updating Javadoc comments and incorporating any
review comments. Apart from those 2 this should be good to go in.
> Minhash based clustering
> -------------------------
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Ankur
> Assignee: Ankur
> Fix For: 0.4
>
> Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch,
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch, MAHOUT-344-v5.patch,
> MAHOUT-344-v6.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high
> dimensional data. The essence of the technique is to hash each item using
> multiple independent hash functions such that the probability of collision of
> similar items is higher. Multiple such hash tables can then be constructed
> to answer near neighbor type of queries efficiently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.