Re: [GSOC] Ranking Process

2009-04-01 Thread Richard Tomsett
I'm preparing an application, but haven't submitted yet as I was
waiting on confirmation of my student status... as I now know that I'm
going to be eligible I'll get my application in soon :)

2009/4/1 Ted Dunning ted.dunn...@gmail.com:
 I only see two applications for Mahout, one reasonably strong, one much less
 so.

 Are there students out there who still need to prepare an application?

 The deadline is coming up fast.

 2009/3/31 Grant Ingersoll gsing...@apache.org

 FYI: http://wiki.apache.org/general/RankingProcess

 -Grant




 --
 Ted Dunning, CTO
 DeepDyve



[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

2009-03-18 Thread Richard Tomsett (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683254#action_12683254
 ] 

Richard Tomsett commented on MAHOUT-59:
---

Ugh, I had an example almost done but managed to over-write it by having 
folders with too-similar names. That'll teach me :-\ anyway, looking at the 
K-Means issue [MAHOUT-99] at the moment but will hopefully post a bag of words 
example relatively soon...!

 Create some examples of clustering well-known datasets
 --

 Key: MAHOUT-59
 URL: https://issues.apache.org/jira/browse/MAHOUT-59
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: Jeff Eastman
 Attachments: MAHOUT-59.patch


 The existing unit tests for clustering need to be augmented with examples 
 from the literature which illustrate its correct operation on datasets which 
 have known clusters present. See http://archive.ics.uci.edu/ml/ for some 
 candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-109) Implementation of Cosine distance measure, plus unit test.

2009-03-08 Thread Richard Tomsett (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Tomsett updated MAHOUT-109:
---

Status: Patch Available  (was: Open)

 Implementation of Cosine distance measure, plus unit test.
 --

 Key: MAHOUT-109
 URL: https://issues.apache.org/jira/browse/MAHOUT-109
 Project: Mahout
  Issue Type: Improvement
Reporter: Richard Tomsett
Priority: Trivial

 This is a class implementing a cosine distance measure. In various places 
 I've seen cosine *similarity* defined as being the cosine of the angle 
 between vectors - cos(a,b) - and cosine *distance* being (1 - cos(a,b)), so 
 in keeping with the other distance measures, this returns 1-cos(angle) as the 
 distance.
 Made a new test class rather than using the default distance measure check as 
 the vectors in the current default test class all have a cosine distance of 
 zero between them ([1,1,1,1,1,1], [3,3,3,3,3,3] and [6,6,6,6,6,6]). The test 
 checks the cosine distances between [1,0,0,0,0,0], [1,1,1,0,0,0] and 
 [1,1,1,1,1,1].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-59) Create some examples of clustering well-known datasets

2009-02-19 Thread Richard Tomsett (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674983#action_12674983
 ] 

richardtomsett edited comment on MAHOUT-59 at 2/19/09 4:55 AM:


Re: discussion of text clustering on the mailing list, there are several 'bag 
of words' examples at the UCI repository: 
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . The data is in [docID 
wordID wordcount] format so needs to be processed into TF-IDF Vectors for 
clustering. I previously did this with a Python script but I'll write something 
in Hadoop to do it, before passing it on to Canopy or K-Means clustering. May 
take a little while as I haven't looked at my code for about half a year, and I 
didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess 
would be useful generally. Could also involve ideas from 
https://issues.apache.org/jira/browse/MAHOUT-65 re: labelling data points 
(documents). Would this be a useful example?

  was (Author: richardtomsett):
Re: discussion of text clustering on the mailing list, there are several 
'bag of words' examples at the UCI repository: 
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . The data is in [docID 
wordID wordcount] format so needs to be processed into TF-IDF Vectors for 
clustering. I previously did this with a Python script but I'll write something 
in Hadoop to do it, before passing it on to Canopy or K-Means clustering. May 
take a little while as I haven't looked at my code for about half a year, and I 
didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess 
would be useful generally. Would this be a useful example?
  
 Create some examples of clustering well-known datasets
 --

 Key: MAHOUT-59
 URL: https://issues.apache.org/jira/browse/MAHOUT-59
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: Jeff Eastman
 Attachments: MAHOUT-59.patch


 The existing unit tests for clustering need to be augmented with examples 
 from the literature which illustrate its correct operation on datasets which 
 have known clusters present. See http://archive.ics.uci.edu/ml/ for some 
 candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.