Many if not most Mahout committers and contributors will be new to Scala and Spark, certainly to the Mahout Scala DSL.
I’m a complete noob to Spark and Scala so I dove into Scala as a first step. It is deceptively simple but you run into odd limitations and special cases quickly. Anyway a good starting point seems to be Scala, especially its functional programming features. Those plus Spark’s architecture, the Mahout Scala DSL, and (especially for the scientist types out there) the Mahout Shell will make doing new code a couple of orders of magnitude easier than java/hadoop/mapreduce. There is very strong support for Scala on stackoverflow. You will see my simpleton questions there and I encourage everyone to take advantage because the volume of stuff to Google is much smaller than for Java (obviously?) On May 30, 2014, at 3:12 PM, Andrew Palumbo <[email protected]> wrote: >> IMO we should wait on core DSL functionality if it’s >> not there but if you are doing something that is external >> then full blown dataframes may not block you or even help you. >> Drms are pretty mature. You’ll have to decide that based on >> your own needs.Also wanted to say I agree completely- not trying to jump the >> gun on this. From: [email protected] To: [email protected] Subject: RE: Sketching out scala traits and 1.0 API Date: Fri, 30 May 2014 18:04:33 -0400 Just jumping in here real quick.. not trying to derail the conversation... I have a lot of catching up to do on the status of the Dataframe implementation, the DSL, Pat's ItemSimiliarity implementation so that i can better understand what's going on and. I'm going to try to take a look at this stuff over the weekend I think i see how my thinking of this has been wrong in terms of "Translating a Dataframe to a DRM". Also I think that NB was a bad example because it's kind of a special case classifier. I guess from my end what im wondering of in terms of laying out traits for classifiers is are we going to try to provide a kind of weka or R-like pluggable interface? and if so, how would that look? I guess I'm speaking specifically about about batch trained, supervised, classification algorithms at this point. (Which im not sure going forward if anybody is interested in, but I am). For example, I'm doing some work right that involves comparing results from some off the shelf algorithms. Working in R, with a small dense dataset- nothing really novel. Once my dataframe is all set up, switching classifiers looks like basically like this: # Train a random forest res.rf <- randomForest( formula=formula, data=d_train, nodesize=1, classwt=CLASSWT, sampsize=length(d_train[,1]), proximity=F, na.action=na.roughfix, ntree=1000) # Train an rPartTree res.rf <-rpart( formula=formula, data=d_train, method="class", control=rpart.control(minsplit=2, cp=0)) I know that this is not that useful to the typical Mahout user right now. But with a shell/script, a Linear Algebra DSL with a distributed back end and a bunch of algorithms in the library, i think that this will be, or will draw in new users. The reason I brought up the full NB pipeline is to ensure that if we are to lay out traits for new (classification) algorithms, it is done so in a the most robust way possible, and in a way that eases development from prototyping in the shell to deployment. > Date: Fri, 30 May 2014 14:54:20 -0700 > Subject: Re: Sketching out scala traits and 1.0 API > From: [email protected] > To: [email protected] > > Frankly, except for columnar organization and sine math summarization > functionality, i don't see much difference between these data frames and > e.g. scalding tuple-based manipulations. > > > On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> I am not sure i understand the question. It would possible to save results >> of rowSimilarityJob as a data frame. No, data frames do not support quick >> bidirectional indexing on demand in a sense if we wanted to bring full >> column or row to front-end process very quickly (e.g. row id -> row vector, >> or columnName -> column). They will support iterative filtering and >> mutating just like in dplyr package of R. (I hope). >> >> In general, i'd only say that data frames are called data frames because >> the scope of functionality and intent is that of R data frames (there's no >> other source for the term of "data frame", i.e. matlab doesn't have those i >> think) minus quick random individual cell access which is replaced by >> dplyr-style FP computations. >> >> So really i'd say one needs to look at dplyr and R to understand the scope >> of this at this point in my head. >> >> Filtering over rows (including there labels) is implied by dplyr and R. >> column selection pattern is a bit different, via %.% select() and %.% >> mutate (it assumes data frames are like tables, few attributes but a lot of >> rows). Data frames are therefore do not respond well to linalg operations >> that often require a lot of orientation change. >> >> >> >> On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote: >> >>> >>>>> Something that concerns me about dataframes is whether they will be >>> useful >>>>> for batch operations given D’s avowed lack of interest :-) >>>>> >>>> >>>> Pat, please don't dump everything in one pile :) >>>> >>> >>> Only kidding ——> :-) >>> >>>> >>>> Every other stage here (up to training) are usually either batching or >>>> streaming. Data frames are to be used primarily in featurization and >>>> vectorization, which is either streaming (in Spark/Storm sense) or a >>>> batch. These stages can benefit from fast columnar organization of data >>>> frames allowing fast multiple passes. I can imagine some methodologies >>> in >>>> training _may_ work better off data frames too, rather than off the >>>> matrices. >>>> >>>> hope that clarifies. >>>> >>> >>> Well that brings us to the real question: if we need to serialize a drm >>> with restored user specified row and column IDs do you expect some future >>> dataframe will support this well? I’d guess this would be some kind of .map >>> over rows. Like this, only getting ID values from the dataframe: >>> >>> matrix.rdd.map({ case (rowID, itemVector) => >>> var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1 >>> for (item <- itemVector.nonZeroes()) { >>> line += columnIDDictionary.inverse.get(item.index) + outDelim2 >>> + item.get + outDelim3 >>> } >>> line.dropRight(1) >>> }) >>> .saveAsTextFile(dest) >>> >>> A similar question applies to deserializing or building a dataframe. I >>> ask because IndexedDataset uses does Guava HashBiMaps in memory on all >>> cluster machines. Seems like a potential scaling issue but then a >>> distributed HashMap is called a database. >> >> >>
