Many if not most Mahout committers and contributors will be new to Scala and 
Spark, certainly to the Mahout Scala DSL.

I’m a complete noob to Spark and Scala so I dove into Scala as a first step. It 
is deceptively simple but you run into odd limitations and special cases 
quickly. Anyway a good starting point seems to be Scala, especially its 
functional programming features. Those plus Spark’s architecture, the Mahout 
Scala DSL, and (especially for the scientist types out there) the Mahout Shell 
will make doing new code a couple of orders of magnitude easier than 
java/hadoop/mapreduce.

There is very strong support for Scala on stackoverflow. You will see my 
simpleton questions there and I encourage everyone to take advantage because 
the volume of stuff to Google is much smaller than for Java (obviously?)

On May 30, 2014, at 3:12 PM, Andrew Palumbo <[email protected]> wrote:

>> IMO we should wait on core DSL functionality if it’s
>> not there but if you are doing something that is external
>> then full blown dataframes may not block you or even help you. 
>> Drms are pretty mature. You’ll have to decide that based on 
>> your own needs.Also wanted to say I agree completely- not trying to jump the 
>> gun on this. 

From: [email protected]
To: [email protected]
Subject: RE: Sketching out scala traits and 1.0 API
Date: Fri, 30 May 2014 18:04:33 -0400




Just jumping in here real quick.. not trying to derail the conversation...

I have a lot of catching up to do on the status of the Dataframe 
implementation, the DSL, Pat's ItemSimiliarity implementation so that i can 
better understand what's going on and. I'm going to try to take a look at this 
stuff over the weekend

I think i see how my thinking of this has been wrong in terms of "Translating a 
Dataframe to a DRM".  Also I think that NB was a bad example because it's kind 
of a special case classifier.

I guess from my end what im wondering of in terms of laying out traits for 
classifiers is are we going to try to provide a kind of weka or R-like 
pluggable interface? and if so, how would that look?  I guess I'm speaking 
specifically about about batch trained, supervised, classification algorithms 
at this point. (Which im not sure going forward if anybody is interested in, 
but I am).

For example, I'm doing some work right that involves comparing results from 
some off the shelf algorithms. Working in R, with a small dense dataset- 
nothing really novel.  Once my dataframe is all set up, switching classifiers 
looks like basically like this:

# Train a random forest
res.rf <- randomForest( formula=formula, data=d_train, nodesize=1,
                       classwt=CLASSWT, sampsize=length(d_train[,1]), 
                       proximity=F, na.action=na.roughfix, ntree=1000)  
# Train an rPartTree
res.rf <-rpart( formula=formula, data=d_train, method="class",
               control=rpart.control(minsplit=2, cp=0))

I know that this is not that useful to the typical Mahout user right now.  But 
with a shell/script, a Linear Algebra DSL with a distributed back end and a 
bunch of algorithms in the library, i think that this will be, or will draw in 
new users.  

The reason I brought up the full NB pipeline is to ensure that if we are to lay 
out traits for new (classification) algorithms, it is done so in a the most 
robust way possible, and in a way that eases development from prototyping in 
the shell to deployment.     





> Date: Fri, 30 May 2014 14:54:20 -0700
> Subject: Re: Sketching out scala traits and 1.0 API
> From: [email protected]
> To: [email protected]
> 
> Frankly, except for columnar organization and sine math summarization
> functionality,  i don't see much difference between these data frames and
> e.g. scalding tuple-based manipulations.
> 
> 
> On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
>> I am not sure i understand the question. It would possible to save results
>> of rowSimilarityJob as a data frame. No, data frames do not support quick
>> bidirectional indexing on demand in a sense if we wanted to bring full
>> column or row to front-end process very quickly (e.g. row id -> row vector,
>> or columnName -> column). They will support iterative filtering and
>> mutating just like in dplyr package of R. (I hope).
>> 
>> In general, i'd only say that data frames are called data frames because
>> the scope of functionality and intent is that of R data frames (there's no
>> other source for the term of "data frame", i.e. matlab doesn't have those i
>> think) minus quick random individual cell access which is replaced by
>> dplyr-style FP computations.
>> 
>> So really i'd say one needs to look at dplyr and R to understand the scope
>> of this at this point in my head.
>> 
>> Filtering over rows (including there labels) is implied by dplyr and R.
>> column selection pattern is a bit different, via %.% select() and %.%
>> mutate (it assumes data frames are like tables, few attributes but a lot of
>> rows). Data frames are therefore do not respond well to linalg operations
>> that often require a lot of orientation change.
>> 
>> 
>> 
>> On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote:
>> 
>>> 
>>>>> Something that concerns me about dataframes is whether they will be
>>> useful
>>>>> for batch operations given D’s avowed lack of interest :-)
>>>>> 
>>>> 
>>>> Pat, please don't dump everything in one  pile :)
>>>> 
>>> 
>>> Only kidding ——> :-)
>>> 
>>>> 
>>>> Every other stage here (up to training) are usually either batching or
>>>> streaming. Data frames are to be used primarily in featurization and
>>>> vectorization, which is  either streaming (in Spark/Storm sense) or a
>>>> batch. These stages can benefit from fast columnar organization of data
>>>> frames allowing fast multiple passes. I can imagine some methodologies
>>> in
>>>> training _may_ work better off data frames too, rather than off the
>>>> matrices.
>>>> 
>>>> hope that clarifies.
>>>> 
>>> 
>>> Well that brings us to the real question: if we need to serialize a drm
>>> with restored user specified row and column IDs do you expect  some future
>>> dataframe will support this well? I’d guess this would be some kind of .map
>>> over rows. Like this, only getting ID values from the dataframe:
>>> 
>>>      matrix.rdd.map({ case (rowID, itemVector) =>
>>>        var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
>>>        for (item <- itemVector.nonZeroes()) {
>>>          line += columnIDDictionary.inverse.get(item.index) + outDelim2
>>> + item.get + outDelim3
>>>        }
>>>        line.dropRight(1)
>>>      })
>>>        .saveAsTextFile(dest)
>>> 
>>> A similar question applies to deserializing or building a dataframe. I
>>> ask because IndexedDataset uses does Guava HashBiMaps in memory on all
>>> cluster machines. Seems like a potential scaling issue but then a
>>> distributed HashMap is called a database.
>> 
>> 
>> 
                                                                                
  

Reply via email to