At CXF, permission to modify the confluence wiki is only granted to people with a CLA on file? Obviously, I have one, but do you need to grant me karma here before I can edit?
On Wed, Jun 17, 2009 at 9:22 AM, Grant Ingersoll <[email protected]>wrote: > > On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote: > > All I know is what I learned from reading the paper. However, I continue >> to >> think, from reading the paper, that you may be trying to make Canopy do >> something it was not intended to do. >> >> As I read the paper, the idea here is to get a rough partitioning that is >> used to optimize various downstream algorithms, not to tune for a precise >> partitioning. The number of canopies doesn't need, as I read it, to be >> particularly close to the number of eventual partitions to be useful. >> >> Thus the extended discussion of how to start up and run various other >> algorithms, (e.g. k-means). >> > > Makes sense. > > >> Now, still, you need to get some useful number of partitions. The paper >> has >> a classic toss-off line, 'we used cross-validation,' without any details >> about exactly what the authors did. Presumably, that means that the author >> ran many possible values and hand-examined the results. The paper reports >> no >> general results about how sensitive the T values are to particular input >> data sets. A pessimist would fear that, for any new input, you're going to >> need to go through a lengthy process to find good values for T1 and T2. >> >> This leads me to wonder, ignorantly, why this project is so focused on >> Canopy. The paper describes it as a tool for speeding up various other >> things. Since you're hadooping all those other things, how much does it >> help? >> > > I don't think anyone is solely focused on it, but it is something that we > have available in our arsenal of clustering tools, therefore it warrants > documentation and understanding of when and how to use. Personally, it's > just something I could easily run to work on MAHOUT-121. > > At any rate, this kind of write up is exactly the advice that we need to be > able to give people. Care to add to > http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ? > > > >> Anyway, I expect that my ignorance is on comprehensive display here. >> >> > Funny, I feel like my ignorance is the one on display, but that is > something I got over a long time ago in open source. Which is why I just > come out and ask the questions! One of my goals for Mahout is to make it a > place where people can come and learn about Machine Learning and get > practical advice and not be afraid to ask basic questions. Machine learning > is so shrouded in mystery it almost seems like a Dark Art. I'm thankful > every day on this project that smarter people than me show up and answer > questions. So, please, keep 'em coming! > > -Grant >
