Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Benson Margulies Wed, 17 Jun 2009 06:30:20 -0700

At CXF, permission to modify the confluence wiki is only granted to people
with a CLA on file? Obviously, I have one, but do you need to grant me karma
here before I can edit?



On Wed, Jun 17, 2009 at 9:22 AM, Grant Ingersoll <[email protected]>wrote:

>
> On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:
>
>  All I know is what I learned from reading the paper. However, I continue
>> to
>> think, from reading the paper, that you may be trying to make Canopy do
>> something it was not intended to do.
>>
>> As I read the paper, the idea here is to get a rough partitioning that is
>> used to optimize various downstream algorithms, not to tune for a precise
>> partitioning. The number of canopies doesn't need, as I read it, to be
>> particularly close to the number of eventual partitions to be useful.
>>
>> Thus the extended discussion of how to start up and run various other
>> algorithms, (e.g. k-means).
>>
>
> Makes sense.
>
>
>> Now, still, you need to get some useful number of partitions. The paper
>> has
>> a classic toss-off line, 'we used cross-validation,' without any details
>> about exactly what the authors did. Presumably, that means that the author
>> ran many possible values and hand-examined the results. The paper reports
>> no
>> general results about how sensitive the T values are to particular input
>> data sets. A pessimist would fear that, for any new input, you're going to
>> need to go through a lengthy process to find good values for T1 and T2.
>>
>> This leads me to wonder, ignorantly, why this project is so focused on
>> Canopy. The paper describes it as a tool for speeding up various other
>> things. Since you're hadooping all those other things, how much does it
>> help?
>>
>
> I don't think anyone is solely focused on it, but it is something that we
> have available in our arsenal of clustering tools, therefore it warrants
> documentation and understanding of when and how to use.  Personally, it's
> just something I could easily run to work on MAHOUT-121.
>
> At any rate, this kind of write up is exactly the advice that we need to be
> able to give people.  Care to add to
> http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ?
>
>
>
>> Anyway, I expect that my ignorance is on comprehensive display here.
>>
>>
> Funny, I feel like my ignorance is the one on display, but that is
> something I got over a long time ago in open source.  Which is why I just
> come out and ask the questions!  One of my goals for Mahout is to make it a
> place where people can come and learn about Machine Learning and get
> practical advice and not be afraid to ask basic questions.  Machine learning
> is so shrouded in mystery it almost seems like a Dark Art.  I'm thankful
> every day on this project that smarter people than me show up and answer
> questions.  So, please, keep 'em coming!
>
> -Grant
>

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Reply via email to