Re: [GSOC 2014] Uniform API for Mahout Clustering

chalitha udara Perera Fri, 21 Mar 2014 04:59:27 -0700

Hi everyone,

I have submitted the proposal [1]. Thanks a lot everyone for valuable
insights.
I would greatly appreciate if you can take few minutes to review it.


[1]
https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/chalitha_perera/5629499534213120

Thanks.
Chalitha




On Wed, Mar 19, 2014 at 1:22 PM, Sebastian Schelter <[email protected]> wrote:

> It's not about directly porting algorithms to Spark, its about porting
> them to a DSL that executes on top of Spark. This page has information
> about it:
>
> https://mahout.apache.org/users/sparkbindings/home.html
>
> --sebastian
>
>
> On 03/19/2014 08:43 AM, chalitha udara Perera wrote:
>
>> Thanks a lot everyone for valuable insights. Since now the main focus is
>> on
>> porting to Spark, I would be really happy to get involved with it. Can you
>> give me more information on current progress with porting, specially
>> regrading clustering component.
>>
>> Regards,
>> Chalitha
>>
>>
>> On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <[email protected]>
>> wrote:
>>
>>
>>>
>>>
>>>
>>>
>>> On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <
>>> [email protected]>
>>> wrote:
>>>
>>> On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera <
>>> [email protected]> wrote:
>>>
>>>  Hi Dmitriy,
>>>>
>>>> I agree with you that i need to be more specific on this matter. Here I
>>>>
>>> was
>>>
>>>> referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b
>>>>
>>> and
>>>
>>>> c.
>>>>
>>>>  He mainly speaks of test coverage there and REST exposition.  What you
>>> saying is a bit more ambitious IMO.
>>>
>>> This was long before the discussion of H2O and Spark had come up. In a
>>> later email, I had also mentioned uniform interfaces for API and porting
>>> stuff to Spark.
>>>
>>>>
>>>> For example this is one thing i have experienced while using mahout
>>>> clustering. I have used both simple kmeans and spectral kmeans and for
>>>> simple kmeans input is the sequence file containing the tfidf vectors of
>>>> the documents while for spectral kmeans it is a csv file defining the
>>>> similarity matrix. It would have been much easier for users if spectral
>>>> kmeans also takes the tfidf vectors and create the similarity matrix
>>>> internally. I think that would improve the usability.
>>>>
>>>>  I don't think clustering is tf-idf specic. I think this is a chance for
>>> proper componentization of concerns here.
>>>
>>> Agree with Dmitriy here.
>>>
>>>
>> Totally Agree. I was just trying to give an example of uniformity.
>>
>>
>>>
>>>> And most of these algorithms are designed to run via the command line. I
>>>> know currently lot of programmers just use run(String []) method for
>>>> programming. I am not saying it is impossible to use Mahout clustering
>>>> algorithms as required. but it takes some effort, most of the you need
>>>> to
>>>> dive into the code internals to use it properly and most of the people
>>>>
>>> are
>>>
>>>> not going to do that. Please provide your valuable insight on this.
>>>>
>>>> I also really interested in the new direction mahout is heading with
>>>>
>>> Spark
>>>
>>>> given that interest for Spark will only grow largely in near future. If
>>>>
>>> you
>>>
>>>> think implementing some of clustering algorithms for example simple
>>>>
>>> kmeans
>>>
>>>> to support spark is more important for next release, I would be happy to
>>>> work on that.
>>>>
>>>>
>>> I would be happy to see you give a try there, too.
>>>
>>>
>>
>>>> Regards,
>>>> Chalitha
>>>>
>>>> [1]
>>>>
>>>>
>>>>  http://mail-archives.apache.org/mod_mbox/mahout-dev/
>>> 201402.mbox/%3C1393554632.3930.YahooMailNeo@web160202.
>>> mail.bf1.yahoo.com%3E
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  I think you need to be a little bit more specific as to what you are
>>>>> proposing exactly.  I think "uniform clustering api" needs a bit of
>>>>> elaboration. I, generally, cannot say that I experienced any pain
>>>>>
>>>> calling
>>>
>>>> out clustering algorithms say in R as a well-documented function. In
>>>>>
>>>> Mahout
>>>>
>>>>> just doing the same was primarily a pain; but assuming one can call it
>>>>>
>>>> with
>>>>
>>>>> ease and even interactively, I can't say I experienced any major
>>>>> inconvenience with just doing this.
>>>>>
>>>>> I guess one can see that one can abstract away notions of clusters and
>>>>> clustering output, but I don't have enough experience to tell whether
>>>>>
>>>> it
>>>
>>>> is
>>>>
>>>>> a good idea to cover _any_ possible clustering methodology.
>>>>>
>>>>>
>>>>> On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
>>>>> [email protected]> wrote:
>>>>>
>>>>>  Hi everyone,
>>>>>>
>>>>>> Greatly appreciate your interest on this issue. I have gone through
>>>>>>
>>>>> the
>>>
>>>> document ScalaSparkBindings [1] . In this project my initial idea was
>>>>>>
>>>>> to
>>>>
>>>>> provide high level API for end user programmers so that they have the
>>>>>> flexibility of plugin in different types of algorithms without
>>>>>>
>>>>> concerning
>>>>
>>>>> about underline details of different types of inputs or outputs.
>>>>>>
>>>>> Also I
>>>
>>>> consider providing proper test coverage for all clustering algorithm
>>>>>>
>>>>> is a
>>>>
>>>>> must for the 1.0 release.
>>>>>>
>>>>>> Would like to get your opinion regarding this and little more detail
>>>>>>
>>>>> on
>>>
>>>> current requirements for clustering would help me to improve
>>>>>>
>>>>> proposal.
>>>
>>>>
>>>>>> Thanks,
>>>>>> Chalitha
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <
>>>>>>
>>>>> [email protected]
>>>
>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>  Yes. there's interest.
>>>>>>> Note that we are trying to unify linear algebra primitives and
>>>>>>>
>>>>>> optimization
>>>>>>
>>>>>>> on Spark as well. All new linear algebra and interaction with spark
>>>>>>>
>>>>>> context
>>>>>>
>>>>>>> should probably go thru this layer. This is ongoing thing but some
>>>>>>>
>>>>>> stuff
>>>>>
>>>>>> is
>>>>>>
>>>>>>> working [1]
>>>>>>>
>>>>>>> [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>>
>>>>>>>> Going through the mail tread Mahout 1.0 goals, I found that the
>>>>>>>>
>>>>>>> main
>>>>
>>>>> focus
>>>>>>>
>>>>>>>> of mahout is now towards the code re-factoring and integration
>>>>>>>>
>>>>>>> with
>>>
>>>> Spark
>>>>>>
>>>>>>> rather than implementing new algorithms. Recently I have used
>>>>>>>>
>>>>>>> mahout
>>>>
>>>>> for
>>>>>>
>>>>>>> implementing document clustering module a Content Management
>>>>>>>>
>>>>>>> System.
>>>>
>>>>>
>>>>>>>> To be honest we had some problems with lack of uniformity among
>>>>>>>>
>>>>>>> different
>>>>>>
>>>>>>> clustering algorithms. For example simple Kmeans takes input as
>>>>>>>>
>>>>>>> the
>>>
>>>> sequence file with document TF-IDF vectors, while Spectral Kmeans
>>>>>>>>
>>>>>>> takes
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> csv file that defines the similarity matrix.
>>>>>>>>
>>>>>>>> I think if we can provide a uniform clustering API as mentioned
>>>>>>>>
>>>>>>> in
>>>
>>>> 1.0
>>>>>
>>>>>> goals, it would be very useful for end user developers.
>>>>>>>>
>>>>>>>> I would like to proceed with this idea as my GSOC 2014 project.
>>>>>>>>
>>>>>>> Please
>>>>>
>>>>>> let
>>>>>>>
>>>>>>>> me know if you are interested in this project
>>>>>>>> --
>>>>>>>> J.M Chalitha Udara Perera
>>>>>>>>
>>>>>>>> *Department of Computer Science and Engineering,*
>>>>>>>> *University of Moratuwa,*
>>>>>>>> *Sri Lanka*
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> J.M Chalitha Udara Perera
>>>>>>
>>>>>> *Department of Computer Science and Engineering,*
>>>>>> *University of Moratuwa,*
>>>>>> *Sri Lanka*
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> J.M Chalitha Udara Perera
>>>>
>>>> *Department of Computer Science and Engineering,*
>>>> *University of Moratuwa,*
>>>> *Sri Lanka*
>>>>
>>>>
>>>
>>
>>
>>
>


-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Re: [GSOC 2014] Uniform API for Mahout Clustering

Reply via email to