Thanks a lot everyone for valuable insights. Since now the main focus is on porting to Spark, I would be really happy to get involved with it. Can you give me more information on current progress with porting, specially regrading clustering component.
Regards, Chalitha On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <[email protected]>wrote: > > > > > > On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera < > [email protected]> wrote: > > > Hi Dmitriy, > > > > I agree with you that i need to be more specific on this matter. Here I > was > > referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b > and > > c. > > > He mainly speaks of test coverage there and REST exposition. What you > saying is a bit more ambitious IMO. > > This was long before the discussion of H2O and Spark had come up. In a > later email, I had also mentioned uniform interfaces for API and porting > stuff to Spark. > > > > For example this is one thing i have experienced while using mahout > > clustering. I have used both simple kmeans and spectral kmeans and for > > simple kmeans input is the sequence file containing the tfidf vectors of > > the documents while for spectral kmeans it is a csv file defining the > > similarity matrix. It would have been much easier for users if spectral > > kmeans also takes the tfidf vectors and create the similarity matrix > > internally. I think that would improve the usability. > > > I don't think clustering is tf-idf specic. I think this is a chance for > proper componentization of concerns here. > > Agree with Dmitriy here. > Totally Agree. I was just trying to give an example of uniformity. > > > > > And most of these algorithms are designed to run via the command line. I > > know currently lot of programmers just use run(String []) method for > > programming. I am not saying it is impossible to use Mahout clustering > > algorithms as required. but it takes some effort, most of the you need to > > dive into the code internals to use it properly and most of the people > are > > not going to do that. Please provide your valuable insight on this. > > > > I also really interested in the new direction mahout is heading with > Spark > > given that interest for Spark will only grow largely in near future. If > you > > think implementing some of clustering algorithms for example simple > kmeans > > to support spark is more important for next release, I would be happy to > > work on that. > > > > I would be happy to see you give a try there, too. > > > > > Regards, > > Chalitha > > > > [1] > > > > > http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%[email protected]%3E > > > > > > > > On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected] > > >wrote: > > > > > I think you need to be a little bit more specific as to what you are > > > proposing exactly. I think "uniform clustering api" needs a bit of > > > elaboration. I, generally, cannot say that I experienced any pain > calling > > > out clustering algorithms say in R as a well-documented function. In > > Mahout > > > just doing the same was primarily a pain; but assuming one can call it > > with > > > ease and even interactively, I can't say I experienced any major > > > inconvenience with just doing this. > > > > > > I guess one can see that one can abstract away notions of clusters and > > > clustering output, but I don't have enough experience to tell whether > it > > is > > > a good idea to cover _any_ possible clustering methodology. > > > > > > > > > On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera < > > > [email protected]> wrote: > > > > > > > Hi everyone, > > > > > > > > Greatly appreciate your interest on this issue. I have gone through > the > > > > document ScalaSparkBindings [1] . In this project my initial idea was > > to > > > > provide high level API for end user programmers so that they have the > > > > flexibility of plugin in different types of algorithms without > > concerning > > > > about underline details of different types of inputs or outputs. > Also I > > > > consider providing proper test coverage for all clustering algorithm > > is a > > > > must for the 1.0 release. > > > > > > > > Would like to get your opinion regarding this and little more detail > on > > > > current requirements for clustering would help me to improve > proposal. > > > > > > > > Thanks, > > > > Chalitha > > > > > > > > > > > > > > > > On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov < > [email protected] > > > > >wrote: > > > > > > > > > Yes. there's interest. > > > > > Note that we are trying to unify linear algebra primitives and > > > > optimization > > > > > on Spark as well. All new linear algebra and interaction with spark > > > > context > > > > > should probably go thru this layer. This is ongoing thing but some > > > stuff > > > > is > > > > > working [1] > > > > > > > > > > [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346 > > > > > > > > > > > > > > > On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera < > > > > > [email protected]> wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > Going through the mail tread Mahout 1.0 goals, I found that the > > main > > > > > focus > > > > > > of mahout is now towards the code re-factoring and integration > with > > > > Spark > > > > > > rather than implementing new algorithms. Recently I have used > > mahout > > > > for > > > > > > implementing document clustering module a Content Management > > System. > > > > > > > > > > > > To be honest we had some problems with lack of uniformity among > > > > different > > > > > > clustering algorithms. For example simple Kmeans takes input as > the > > > > > > sequence file with document TF-IDF vectors, while Spectral Kmeans > > > takes > > > > > the > > > > > > csv file that defines the similarity matrix. > > > > > > > > > > > > I think if we can provide a uniform clustering API as mentioned > in > > > 1.0 > > > > > > goals, it would be very useful for end user developers. > > > > > > > > > > > > I would like to proceed with this idea as my GSOC 2014 project. > > > Please > > > > > let > > > > > > me know if you are interested in this project > > > > > > -- > > > > > > J.M Chalitha Udara Perera > > > > > > > > > > > > *Department of Computer Science and Engineering,* > > > > > > *University of Moratuwa,* > > > > > > *Sri Lanka* > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > J.M Chalitha Udara Perera > > > > > > > > *Department of Computer Science and Engineering,* > > > > *University of Moratuwa,* > > > > *Sri Lanka* > > > > > > > > > > > > > > > -- > > J.M Chalitha Udara Perera > > > > *Department of Computer Science and Engineering,* > > *University of Moratuwa,* > > *Sri Lanka* > > > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
