(1) IMO there's a dependency on engine-independent feature prep. This
depends on data frame api (and translation). Realistically any recommender
framework will not be end-to-end usable without this. This is priority # 1
in my mind.

(2) I personally view CLI as significantly lower priority. This comes from
belief that both embedded and non-embedded use cases will covered by either
using api, or writing a shell script (we can provide shell script templates
to run training flow though, which i tentatively bestowed extension
*.mscala (mahout-scala) upon). We may also need to do some additional
cosmetic shell work here to make script execution and parameterization a
bit easier.

In that sense, CLI and Driver work is not terribly interesting to me (but
that's me).

(3) some stuff inline




On Thu, May 29, 2014 at 4:06 PM, Andrew Palumbo <[email protected]> wrote:

> >
> >    - classify a batch of data
> >
> >    - serialize a model
>

Batch applications may be useful for classification stuff. But for
recommender stuff (like co-occurrence) I have seen exactly 0 real life use
cases of such need so far.

in my experience i never apply recommender-like models on a batch. It is
always real time, and I am ending up using some off-heap memory-mapped
indices to keep random access to model indices instantaneous.

> >
> >    - de-serialize a model
>

In case of indexed serialization format, this rather takes a form of
"mounting" a model. Off-heap is important since indices need to be both
fast (no networking) and not to terrorize GC, potentially surviving sizes
that exceed installed physical RAM. (e.g. when updating/swapping the
model). Physical performance of such indices is found to be in the area of
10k-20k lookups per millisecond per cpu core. That allows to do a very high
QPS recommendation service model without external system to query ("node as
appliance" approach). There eventually probably will come time when
recommendation indices become too huge to fit well into available virtual
memory, but in practice i am still waiting for that to happen. At least
that's the fastest option to serve multiple recommendations i know.

That  means that I always find myself needing either a good off-heap index
implementation (I use custom-coded partitioned immutable bucketized cuckoo
hashes,  b-trees and walkable PAT tries that can be serialized directly by
streaming into OutputFormat, works for spark too of course). That calls for
some semi-advanced engineering here.

Frankly, i never found myself doing classifications in a batch yet, but i
can see that that indeed may very well be a good case. But online low
latency classification could still be viable.

Stuff like topic analysis on a big corpus are always batches in my
experience, at least for initial topic extraction job.

-d

Reply via email to