RE: Sketching out scala traits and 1.0 API

Andrew Palumbo Thu, 29 May 2014 21:33:14 -0700

Thanks Dmitriy,
I see this is a more complicated issue than I'd originally thought.  I guess 
that's comes with the beauty of an engine agnostic distributed DSL. I'd thought 
that the lynchpin was going to be the Dataframe api, which was part of the 
reason that I'd waited to bring this up until your recent commits and work on 
M-1490. Another part being that I really need to learn the scala/spark 
bindings. I will take some time hopefully over the weekend to get more familiar 
with the scala code as to not turn this thread into "Andy's Questions on the 
DSL".  That being said if I could ask a couple of question on the dataflow for 
the DSL I'd very much appreciate it.


Since I'm most familiar with Mahout Naive Bayes, and Sebastian's already 
started the port of it in M-1493 so I have an idea of how that's going to look 
and Ted's brought up some traits for classifiers, I'll ask in the context of an 
NB classifier.

(1). Is the plan to be able to pull a context specific DRMLike out of the 
Dataframe?

(2). If so, would this be a valid pipeline for an NB classifier?

Batch Training:
1.  Either via Mahout Shell or Mahout Shell script: 
        i.   Create Dataframes X,L and read input 
        ii.  Translate Dataframes to context specific (Spark) DRMLike x,l
        iii. Train NB model on x,l
        iv.  Serialize model
    or via CLI
        i.   Create context specific (Spark) DRMLike x,l and read input
        ii.  Train NB model on x,l
        iii. Serialize model.

Online classifying:
2.  Deploy to a server:
        i.   De-serialize NB model
           a.  Classify incoming documents
           b.  Update model (if supported)


I will look closer at your remarks regarding serialization as well.

Andy

> Date: Thu, 29 May 2014 17:00:32 -0700
> Subject: Re: Sketching out scala traits and 1.0 API
> From: [email protected]
> To: [email protected]
> 
> (1) IMO there's a dependency on engine-independent feature prep. This
> depends on data frame api (and translation). Realistically any recommender
> framework will not be end-to-end usable without this. This is priority # 1
> in my mind.
> 
> (2) I personally view CLI as significantly lower priority. This comes from
> belief that both embedded and non-embedded use cases will covered by either
> using api, or writing a shell script (we can provide shell script templates
> to run training flow though, which i tentatively bestowed extension
> *.mscala (mahout-scala) upon). We may also need to do some additional
> cosmetic shell work here to make script execution and parameterization a
> bit easier.
> 
> In that sense, CLI and Driver work is not terribly interesting to me (but
> that's me).
> 
> (3) some stuff inline
> 
> 
> 
> 
> On Thu, May 29, 2014 at 4:06 PM, Andrew Palumbo <[email protected]> wrote:
> 
> > >
> > >    - classify a batch of data
> > >
> > >    - serialize a model
> >
> 
> Batch applications may be useful for classification stuff. But for
> recommender stuff (like co-occurrence) I have seen exactly 0 real life use
> cases of such need so far.
> 
> in my experience i never apply recommender-like models on a batch. It is
> always real time, and I am ending up using some off-heap memory-mapped
> indices to keep random access to model indices instantaneous.
> 
> > >
> > >    - de-serialize a model
> >
> 
> In case of indexed serialization format, this rather takes a form of
> "mounting" a model. Off-heap is important since indices need to be both
> fast (no networking) and not to terrorize GC, potentially surviving sizes
> that exceed installed physical RAM. (e.g. when updating/swapping the
> model). Physical performance of such indices is found to be in the area of
> 10k-20k lookups per millisecond per cpu core. That allows to do a very high
> QPS recommendation service model without external system to query ("node as
> appliance" approach). There eventually probably will come time when
> recommendation indices become too huge to fit well into available virtual
> memory, but in practice i am still waiting for that to happen. At least
> that's the fastest option to serve multiple recommendations i know.
> 
> That  means that I always find myself needing either a good off-heap index
> implementation (I use custom-coded partitioned immutable bucketized cuckoo
> hashes,  b-trees and walkable PAT tries that can be serialized directly by
> streaming into OutputFormat, works for spark too of course). That calls for
> some semi-advanced engineering here.
> 
> Frankly, i never found myself doing classifications in a batch yet, but i
> can see that that indeed may very well be a good case. But online low
> latency classification could still be viable.
> 
> Stuff like topic analysis on a big corpus are always batches in my
> experience, at least for initial topic extraction job.
> 
> -d

RE: Sketching out scala traits and 1.0 API

Reply via email to