[ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851686#action_12851686
 ] 

Drew Farris commented on MAHOUT-344:
------------------------------------

Hi Cristi,

Sounds like a great start. Answers for a couple of your questions:

{quote}
Is there a standard formatting for the input on each clustering alg or the 
input format follows the same rules for all algorithms, and then the users 
write conversion tools which ?
{quote}

Take a look at the various Vector clases in the math module and the 
VectorWritable wrapper. Most of the clustering algorithms take vectors of one 
kind or another as input and the assumption is that users will write tools to 
convert their data to these common formats. The wiki page 
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html is a good place 
to start

{quote}
would it be ok if I attach the code which does an example of running min-hash 
clustering in the examples dirs ? (it would first convert the dataset format 
accordingly)
{quote}

Go for it, code is good, patches are even better, see: 
http://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContribute-Creatingthepatchfile
 and simply attach it to this issue. 

> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to