Re: Helping out with the .7 release

Jake Mannix Wed, 22 Feb 2012 10:59:22 -0800

On Wed, Feb 22, 2012 at 10:32 AM, Jeff Eastman
<[email protected]>wrote:


> This refactoring is focused on some of the iterative clustering algorithms
> which, in each iteration, load a prior set of clusters ( e.g. clusters-0)
> and process each input vector against them to produce a posterior set of
> clusters (e.g. clusters-1) for the next iteration. This will result in
> k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator
> iterating over a ClusterClassifier using a ClusteringPolicy. You can see
> these classes in o.a.m.clustering. They are a work in progress but
> in-memory, sequential from sequenceFiles and k-means MR work in tests and
> can be demonstrated in the DisplayXX examples which employ them.
>
> Paritosh has also been building a ClusterClassificationDriver
> (o.a.m.clustering.classify) which we want to use to factor all of the
> redundant cluster-data implementations (-cl option) out of the respective
> cluster drivers. This will affect Canopy in addition to the above
> algorithms.
>
> An imagined benefit of this refactoring comes from the fact that
> ClusterClassifier extends AbstractVectorClassifier and implements
> OnlineLearner. We think this means that a posterior set of trained Clusters
> can be used as a component classifier in a semi-supervised classifier
> implementation. I suppose we will need to demonstrate this before we go too
> much further in the refactoring but Ted, at least, seems to approve of this
> integration approach between supervised classification and clustering
> (unsupervised classification). I don't think it has had a lot of other
> eyeballs on it.
>
> I don't think LDA fits into this subset of clustering algorithms as also
> do not Canopy and MeanShift. As you note, it does not produce Clusters but
> I'd be interested in your reactions to the above.
>

So LDA lives in o.a.m.clustering, and does actually produce what you could
*call* clusters - it assigns fuzzy weighted cluster_ids (called topic_ids
in LDA) to training data, in much the same way that fuzzy-kmeans does.  It
also produces things which *act* like a "ClusterClassifier", and while this
is unsupervised, once you extend to Labeled LDA (saving that merge from my
GitHub fork until 0.8 "new features"), it's also a supervised classifier.

I'm not necessarily saying that LDA (and Canopy, and SVD) *must* merge to
use the same API, but if we're doing work to unify these things so they
talk the same language, seeing what the end goal is (maybe not reached in
this round of refactoring) would help inform the process of how we do this
next step.

I can state my reasons for liking going with simple vectors both for the
"classifier" and the "cluster": disk format is the same as our input data,
so when people write utils to hook this up to Pig (and Scalding, Cascalog,
Hive, etc etc), you don't need to write utils for handling new data type,
and even algorithms that take this input can run over the *outputs*: e.g.
you generate a set of "clusters" in LDA which are topics - each one is a
vector over input features, so this collection of vectors can be fed in,
with *no change* into another clustering algorithm, like KMeans, to find
which topics are most like each other [maybe contrived example, but there
may be better ones: make a tree / hierarchy based on topics as inputs
instead of documents as inputs, to see if there is a nice tree structure to
your topic model!   When your outputs are a bunch of custom Cluster
thingees, you can't interoperate with everything else (regression,
vector-based recommenders, etc) without more work.

  -jake


>
> Jeff
>
>
> On 2/22/12 9:55 AM, Jake Mannix wrote:
>
>> So I haven't looked super-carefully at the clustering refactoring work,
>> can
>> someone give a little overview of what
>> the plan is?
>>
>> The NewLDA stuff is technically in "clustering" and generally works by
>> taking in SeqFile<IW,VW>  documents as the training corpus, and spits out
>> two things: SeqFile<IW,VW>  of a "model" (keyed on topicId, one vector per
>> topic) and a SeqFile<IW,VW>  of "classifications" (keyed on docId, one
>> vector over the topic space for projection onto each topic dimension).
>>
>> This is similar to how SVD clustering/decomposition works, but with
>> L1-normed outputs instead of L2.
>>
>> But this seems very different from all of the structures in the rest of
>> clustering.
>>
>>   -jake
>>
>> On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman<jdog@**
>> windwardsolutions.com <[email protected]>>wrote:
>>
>>  Hi Saikat,
>>>
>>> I agree with Paritosh, that a great place to begin would be to write some
>>> unit tests. This will familiarize you with the code base and help us a
>>> lot
>>> with our 0.7 housekeeping release. The new clustering classification
>>> components are going to unify many - but not all - of the existing
>>> clustering algorithms to reduce their complexity by factoring out
>>> duplication and streamlining their integration into semi-supervised
>>> classification engines.
>>>
>>> Please feel free to post any questions you may have in reading through
>>> this code. This is a major refactoring effort and we will need all the
>>> help
>>> we can get. Thanks for the offer,
>>>
>>> Jeff
>>>
>>>
>>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote:
>>>
>>>  Hi Paritosh,Yes creating the test case would be a great first start,
>>>> however are there other tasks you guys need help with before I can do
>>>> before the test creation, I will sync trunk and start reading through
>>>> the
>>>> code in the meantime.Regards
>>>>
>>>>  Date: Wed, 22 Feb 2012 10:57:51 +0530
>>>>
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> Subject: Re: Helping out with the .7 release
>>>>>
>>>>> We are creating clustering as classification components which will help
>>>>> in moving clustering out. Once the component is ready, then the
>>>>> clustering algorithms would need refactoring.
>>>>> The clustering as classification component and the outlier removal
>>>>> component has been created.
>>>>>
>>>>> Most of it is committed, and rest is available as a patch. See
>>>>> https://issues.apache.org/****jira/browse/MAHOUT-929<https://issues.apache.org/**jira/browse/MAHOUT-929>
>>>>> <https:/**/issues.apache.org/jira/**browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929>
>>>>> >
>>>>>
>>>>> If you will apply the latest patch available on Mahout-929 you can see
>>>>> all that is available now.
>>>>>
>>>>> If you want, you can help with the test case of
>>>>> ClusterClassificationMapper available in the patch.
>>>>>
>>>>> On 22-02-2012 10:27, Saikat Kanjilal wrote:
>>>>>
>>>>>  Hi Guys,I was interested in helping out with the clustering component
>>>>>> of mahout, I looked through the JIRA items below and was wondering if
>>>>>> there
>>>>>> is a specific one that would be good to start with:
>>>>>>
>>>>>> https://issues.apache.org/****jira/secure/IssueNavigator.**<https://issues.apache.org/**jira/secure/IssueNavigator.**>
>>>>>> jspa?reset=true&jqlQuery=****project+%3D+MAHOUT+AND+**
>>>>>> resolution+%3D+Unresolved+AND+****component+%3D+Clustering+**
>>>>>> ORDER+BY+priority+DESC&mode=****hide<https://issues.apache.**
>>>>>> org/jira/secure/**IssueNavigator.jspa?reset=**
>>>>>> true&jqlQuery=project+%3D+**MAHOUT+AND+resolution+%3D+**
>>>>>> Unresolved+AND+component+%3D+**Clustering+ORDER+BY+priority+**
>>>>>> DESC&mode=hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide>
>>>>>> >
>>>>>>
>>>>>>
>>>>>> I initially was thinking to work on Mahout-930 or Mahout-931 but could
>>>>>> work on others if needed.
>>>>>> Best Regards
>>>>>>
>>>>>>
>>>
>

Re: Helping out with the .7 release

Reply via email to