Hi everyone, I have submitted the proposal [1]. Thanks a lot everyone for valuable insights. I would greatly appreciate if you can take few minutes to review it.
[1] https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/chalitha_perera/5629499534213120 Thanks. Chalitha On Wed, Mar 19, 2014 at 1:22 PM, Sebastian Schelter <[email protected]> wrote: > It's not about directly porting algorithms to Spark, its about porting > them to a DSL that executes on top of Spark. This page has information > about it: > > https://mahout.apache.org/users/sparkbindings/home.html > > --sebastian > > > On 03/19/2014 08:43 AM, chalitha udara Perera wrote: > >> Thanks a lot everyone for valuable insights. Since now the main focus is >> on >> porting to Spark, I would be really happy to get involved with it. Can you >> give me more information on current progress with porting, specially >> regrading clustering component. >> >> Regards, >> Chalitha >> >> >> On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <[email protected]> >> wrote: >> >> >>> >>> >>> >>> >>> On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov < >>> [email protected]> >>> wrote: >>> >>> On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera < >>> [email protected]> wrote: >>> >>> Hi Dmitriy, >>>> >>>> I agree with you that i need to be more specific on this matter. Here I >>>> >>> was >>> >>>> referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b >>>> >>> and >>> >>>> c. >>>> >>>> He mainly speaks of test coverage there and REST exposition. What you >>> saying is a bit more ambitious IMO. >>> >>> This was long before the discussion of H2O and Spark had come up. In a >>> later email, I had also mentioned uniform interfaces for API and porting >>> stuff to Spark. >>> >>>> >>>> For example this is one thing i have experienced while using mahout >>>> clustering. I have used both simple kmeans and spectral kmeans and for >>>> simple kmeans input is the sequence file containing the tfidf vectors of >>>> the documents while for spectral kmeans it is a csv file defining the >>>> similarity matrix. It would have been much easier for users if spectral >>>> kmeans also takes the tfidf vectors and create the similarity matrix >>>> internally. I think that would improve the usability. >>>> >>>> I don't think clustering is tf-idf specic. I think this is a chance for >>> proper componentization of concerns here. >>> >>> Agree with Dmitriy here. >>> >>> >> Totally Agree. I was just trying to give an example of uniformity. >> >> >>> >>>> And most of these algorithms are designed to run via the command line. I >>>> know currently lot of programmers just use run(String []) method for >>>> programming. I am not saying it is impossible to use Mahout clustering >>>> algorithms as required. but it takes some effort, most of the you need >>>> to >>>> dive into the code internals to use it properly and most of the people >>>> >>> are >>> >>>> not going to do that. Please provide your valuable insight on this. >>>> >>>> I also really interested in the new direction mahout is heading with >>>> >>> Spark >>> >>>> given that interest for Spark will only grow largely in near future. If >>>> >>> you >>> >>>> think implementing some of clustering algorithms for example simple >>>> >>> kmeans >>> >>>> to support spark is more important for next release, I would be happy to >>>> work on that. >>>> >>>> >>> I would be happy to see you give a try there, too. >>> >>> >> >>>> Regards, >>>> Chalitha >>>> >>>> [1] >>>> >>>> >>>> http://mail-archives.apache.org/mod_mbox/mahout-dev/ >>> 201402.mbox/%3C1393554632.3930.YahooMailNeo@web160202. >>> mail.bf1.yahoo.com%3E >>> >>>> >>>> >>>> >>>> On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected] >>>> >>>>> wrote: >>>>> >>>> >>>> I think you need to be a little bit more specific as to what you are >>>>> proposing exactly. I think "uniform clustering api" needs a bit of >>>>> elaboration. I, generally, cannot say that I experienced any pain >>>>> >>>> calling >>> >>>> out clustering algorithms say in R as a well-documented function. In >>>>> >>>> Mahout >>>> >>>>> just doing the same was primarily a pain; but assuming one can call it >>>>> >>>> with >>>> >>>>> ease and even interactively, I can't say I experienced any major >>>>> inconvenience with just doing this. >>>>> >>>>> I guess one can see that one can abstract away notions of clusters and >>>>> clustering output, but I don't have enough experience to tell whether >>>>> >>>> it >>> >>>> is >>>> >>>>> a good idea to cover _any_ possible clustering methodology. >>>>> >>>>> >>>>> On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera < >>>>> [email protected]> wrote: >>>>> >>>>> Hi everyone, >>>>>> >>>>>> Greatly appreciate your interest on this issue. I have gone through >>>>>> >>>>> the >>> >>>> document ScalaSparkBindings [1] . In this project my initial idea was >>>>>> >>>>> to >>>> >>>>> provide high level API for end user programmers so that they have the >>>>>> flexibility of plugin in different types of algorithms without >>>>>> >>>>> concerning >>>> >>>>> about underline details of different types of inputs or outputs. >>>>>> >>>>> Also I >>> >>>> consider providing proper test coverage for all clustering algorithm >>>>>> >>>>> is a >>>> >>>>> must for the 1.0 release. >>>>>> >>>>>> Would like to get your opinion regarding this and little more detail >>>>>> >>>>> on >>> >>>> current requirements for clustering would help me to improve >>>>>> >>>>> proposal. >>> >>>> >>>>>> Thanks, >>>>>> Chalitha >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov < >>>>>> >>>>> [email protected] >>> >>>> wrote: >>>>>>> >>>>>> >>>>>> Yes. there's interest. >>>>>>> Note that we are trying to unify linear algebra primitives and >>>>>>> >>>>>> optimization >>>>>> >>>>>>> on Spark as well. All new linear algebra and interaction with spark >>>>>>> >>>>>> context >>>>>> >>>>>>> should probably go thru this layer. This is ongoing thing but some >>>>>>> >>>>>> stuff >>>>> >>>>>> is >>>>>> >>>>>>> working [1] >>>>>>> >>>>>>> [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346 >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>>> >>>>>>>> Going through the mail tread Mahout 1.0 goals, I found that the >>>>>>>> >>>>>>> main >>>> >>>>> focus >>>>>>> >>>>>>>> of mahout is now towards the code re-factoring and integration >>>>>>>> >>>>>>> with >>> >>>> Spark >>>>>> >>>>>>> rather than implementing new algorithms. Recently I have used >>>>>>>> >>>>>>> mahout >>>> >>>>> for >>>>>> >>>>>>> implementing document clustering module a Content Management >>>>>>>> >>>>>>> System. >>>> >>>>> >>>>>>>> To be honest we had some problems with lack of uniformity among >>>>>>>> >>>>>>> different >>>>>> >>>>>>> clustering algorithms. For example simple Kmeans takes input as >>>>>>>> >>>>>>> the >>> >>>> sequence file with document TF-IDF vectors, while Spectral Kmeans >>>>>>>> >>>>>>> takes >>>>> >>>>>> the >>>>>>> >>>>>>>> csv file that defines the similarity matrix. >>>>>>>> >>>>>>>> I think if we can provide a uniform clustering API as mentioned >>>>>>>> >>>>>>> in >>> >>>> 1.0 >>>>> >>>>>> goals, it would be very useful for end user developers. >>>>>>>> >>>>>>>> I would like to proceed with this idea as my GSOC 2014 project. >>>>>>>> >>>>>>> Please >>>>> >>>>>> let >>>>>>> >>>>>>>> me know if you are interested in this project >>>>>>>> -- >>>>>>>> J.M Chalitha Udara Perera >>>>>>>> >>>>>>>> *Department of Computer Science and Engineering,* >>>>>>>> *University of Moratuwa,* >>>>>>>> *Sri Lanka* >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> J.M Chalitha Udara Perera >>>>>> >>>>>> *Department of Computer Science and Engineering,* >>>>>> *University of Moratuwa,* >>>>>> *Sri Lanka* >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> J.M Chalitha Udara Perera >>>> >>>> *Department of Computer Science and Engineering,* >>>> *University of Moratuwa,* >>>> *Sri Lanka* >>>> >>>> >>> >> >> >> > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
