[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851686#action_12851686 ]
Drew Farris commented on MAHOUT-344: ------------------------------------ Hi Cristi, Sounds like a great start. Answers for a couple of your questions: {quote} Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ? {quote} Take a look at the various Vector clases in the math module and the VectorWritable wrapper. Most of the clustering algorithms take vectors of one kind or another as input and the assumption is that users will write tools to convert their data to these common formats. The wiki page http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html is a good place to start {quote} would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly) {quote} Go for it, code is good, patches are even better, see: http://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContribute-Creatingthepatchfile and simply attach it to this issue. > Minhash based clustering > ------------------------- > > Key: MAHOUT-344 > URL: https://issues.apache.org/jira/browse/MAHOUT-344 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.3 > Reporter: Ankur > Assignee: Ankur > Attachments: MAHOUT-344-v1.patch > > > Minhash clustering performs probabilistic dimension reduction of high > dimensional data. The essence of the technique is to hash each item using > multiple independent hash functions such that the probability of collision of > similar items is higher. Multiple such hash tables can then be constructed > to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.