On Fri, May 30, 2014 at 6:14 AM, Pat Ferrel <[email protected]> wrote:

> A dataframe isn’t required for batch training, if all you need is drm
> functionality. What ItemsSimilairty does is wrap a drm with two
> dictionaries in an IndexedDataset. Since the wrapping is almost trivial it
> can be replaced with dataframes later, or not. But the wrapping allows us
> to read and write data in user specified form. Right now that means text
> delimited files. It then outputs the similarity matrix in TDF. One thing
> that concerns me about the IndexedDataset, which may be solved by
> dataframes, is that the dictionaries are completely in memory on each
> cluster node. It might be better to have an RDD backed form for times when
> the dictionaries are too large.
>
> Dmitriy are ID mappings kept in an RDD backed thing with dataframes?
>
Our data frames are data frames in R sense. I.e. they will have column
names and (maybe) row keys (row keys can also be representing simply by a
data frame column too).

>
> Something that concerns me about dataframes is whether they will be useful
> for batch operations given D’s avowed lack of interest :-)
>

Pat, please don't dump everything in one  pile :)

_any_ ML learning framework has a process of several stages, here's my
break down :

- featurization (log collection, consolidation, joining, attribution into a
coherent collection)
- vectorization/standardization  (simply, turn features into numbers,
vectors or matrices)
- training
   (here we insert serialization -> delivery -> deserialization)
- prediction
- logging

What i was commenting about was specifically prediction and logging stages.
One way of doing this is to compute top predictions for each user. Another
way of doing this is to compute prediction in real time (usually, because
some of predictors and business rules are only known at the time of
recommendation impression, e.g. time of day or geo -- so you have no other
way but to feed it into predict() on the fly.)

Every other stage here (up to training) are usually either batching or
streaming. Data frames are to be used primarily in featurization and
vectorization, which is  either streaming (in Spark/Storm sense) or a
batch. These stages can benefit from fast columnar organization of data
frames allowing fast multiple passes. I can imagine some methodologies in
training _may_ work better off data frames too, rather than off the
matrices.

hope that clarifies.



> For I/O I’ll put together a proposal this weekend (based on running IS
> code) that has an abstract Store class.  At instantiation time you compose
> one or two traits as mixins for reading and writing. This allows a good
> degree of extensibility. It should also work fine for getting a drm into or
> out of the shell through I haven’t tried it. The only read/write traits
> implemented are TDF.
>
> There is a proposed MahoutOptionParser, which is a trivial mod of Scopt, a
> Scala options parser.
>
> IMO we should wait on core DSL functionality if it’s not there but if you
> are doing something that is external then full blown dataframes may not
> block you or even help you. Drms are pretty mature. You’ll have to decide
> that based on your own needs.
>
> As to recommenders I see no reason to wait for dataframes. In fact with IS
> running, there is nothing related to the DSL left to do. Put an indicator
> matrix with external IDs into Solr and you have an interactive recommender.
>
> On May 29, 2014, at 9:32 PM, Andrew Palumbo <[email protected]> wrote:
>
> Thanks Dmitriy,
> I see this is a more complicated issue than I'd originally thought.  I
> guess that's comes with the beauty of an engine agnostic distributed DSL.
> I'd thought that the lynchpin was going to be the Dataframe api, which was
> part of the reason that I'd waited to bring this up until your recent
> commits and work on M-1490. Another part being that I really need to learn
> the scala/spark bindings. I will take some time hopefully over the weekend
> to get more familiar with the scala code as to not turn this thread into
> "Andy's Questions on the DSL".  That being said if I could ask a couple of
> question on the dataflow for the DSL I'd very much appreciate it.
>
> Since I'm most familiar with Mahout Naive Bayes, and Sebastian's already
> started the port of it in M-1493 so I have an idea of how that's going to
> look and Ted's brought up some traits for classifiers, I'll ask in the
> context of an NB classifier.
>
> (1). Is the plan to be able to pull a context specific DRMLike out of the
> Dataframe?
>
> (2). If so, would this be a valid pipeline for an NB classifier?
>
> Batch Training:
> 1.  Either via Mahout Shell or Mahout Shell script:
>        i.   Create Dataframes X,L and read input
>        ii.  Translate Dataframes to context specific (Spark) DRMLike x,l
>        iii. Train NB model on x,l
>        iv.  Serialize model
>    or via CLI
>        i.   Create context specific (Spark) DRMLike x,l and read input
>        ii.  Train NB model on x,l
>        iii. Serialize model.
>
> Online classifying:
> 2.  Deploy to a server:
>        i.   De-serialize NB model
>           a.  Classify incoming documents
>           b.  Update model (if supported)
>
>
> I will look closer at your remarks regarding serialization as well.
>
> Andy
>
> > Date: Thu, 29 May 2014 17:00:32 -0700
> > Subject: Re: Sketching out scala traits and 1.0 API
> > From: [email protected]
> > To: [email protected]
> >
> > (1) IMO there's a dependency on engine-independent feature prep. This
> > depends on data frame api (and translation). Realistically any
> recommender
> > framework will not be end-to-end usable without this. This is priority #
> 1
> > in my mind.
> >
> > (2) I personally view CLI as significantly lower priority. This comes
> from
> > belief that both embedded and non-embedded use cases will covered by
> either
> > using api, or writing a shell script (we can provide shell script
> templates
> > to run training flow though, which i tentatively bestowed extension
> > *.mscala (mahout-scala) upon). We may also need to do some additional
> > cosmetic shell work here to make script execution and parameterization a
> > bit easier.
> >
> > In that sense, CLI and Driver work is not terribly interesting to me (but
> > that's me).
> >
> > (3) some stuff inline
> >
> >
> >
> >
> > On Thu, May 29, 2014 at 4:06 PM, Andrew Palumbo <[email protected]>
> wrote:
> >
> >>>
> >>>   - classify a batch of data
> >>>
> >>>   - serialize a model
> >>
> >
> > Batch applications may be useful for classification stuff. But for
> > recommender stuff (like co-occurrence) I have seen exactly 0 real life
> use
> > cases of such need so far.
> >
> > in my experience i never apply recommender-like models on a batch. It is
> > always real time, and I am ending up using some off-heap memory-mapped
> > indices to keep random access to model indices instantaneous.
> >
> >>>
> >>>   - de-serialize a model
> >>
> >
> > In case of indexed serialization format, this rather takes a form of
> > "mounting" a model. Off-heap is important since indices need to be both
> > fast (no networking) and not to terrorize GC, potentially surviving sizes
> > that exceed installed physical RAM. (e.g. when updating/swapping the
> > model). Physical performance of such indices is found to be in the area
> of
> > 10k-20k lookups per millisecond per cpu core. That allows to do a very
> high
> > QPS recommendation service model without external system to query ("node
> as
> > appliance" approach). There eventually probably will come time when
> > recommendation indices become too huge to fit well into available virtual
> > memory, but in practice i am still waiting for that to happen. At least
> > that's the fastest option to serve multiple recommendations i know.
> >
> > That  means that I always find myself needing either a good off-heap
> index
> > implementation (I use custom-coded partitioned immutable bucketized
> cuckoo
> > hashes,  b-trees and walkable PAT tries that can be serialized directly
> by
> > streaming into OutputFormat, works for spark too of course). That calls
> for
> > some semi-advanced engineering here.
> >
> > Frankly, i never found myself doing classifications in a batch yet, but i
> > can see that that indeed may very well be a good case. But online low
> > latency classification could still be viable.
> >
> > Stuff like topic analysis on a big corpus are always batches in my
> > experience, at least for initial topic extraction job.
> >
> > -d
>
>
>

Reply via email to