Re: H2O integration - intermediate progress update

Dmitriy Lyubimov Wed, 18 Jun 2014 20:59:16 -0700

On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati <av...@gluster.org> wrote:


> Would it not be possible (or even a good idea) to keep row keys completely
> separate from DRM, and let DRMs be pure nRow x nCol numbers?


Considering this is only at the cost of breaking compatibility with all MR
stuff that's been done in Mahout since 2008. Not an option.
But suppose legacy was not a problem, I see signficant benefits in allowing
non-ordinal keys.

One thing, data almost never  comes out of ETL pipelines with
ordinary-enforced keys. Normalizing ordinarity would be a pain. There's
normalization issue for dense data, and there's uniqueness requirement for
sparse data (in which case it really is no different from any key with only
requirements for hash/equals contracts)

Second, having to map to integral keys is creating problems relating and
maintaining relations of the stuff back to its origins.

Given it's already there, being in a position of an architect, I'd never
give it back.




> None of the
> operators (so far) care about the keys.


Simply not true. LSA does, clustering does, and about other dozen cases in
and outside Mahout. Assuming we are still to support algorithms we have not
deprecated to date.


> At least none of the existing
> mapBlock() users do anything with the key.


not true. Not all examples in Mahout, but not true.


> I'm not sure if we can do
> anything meaningful with the key in a mapBlock.


You not being sure is not sufficient condition. Sufficient condition
everyone has to be sure to the contrary. It is always hard to argue
non-existence of a counter example from positions of probabilities or
intuition.


> It feels they are tightly
> coupled while they need not have been. I must admit I'm new to this, but it
> feels like - keys could be stored in a separate file, and matrix numbers in
> another. Mahout (should) only care about and operate on Matrix numbers,
> reads from the "number" file, writes output to a new "number" file, and the
> user can use the new number file with the old/original "key file" -
> effectively the same result as loading keys and moving them around through
> all the operations and writing back. Am I missing something fundamental?
>

All i said. legacy, ordinality enforcement etc. etc.


>
> Thanks
>
>
> On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > Looking at the code, i am still not sure without trying.
> >
> > but i am more inclined to think now that this specific combination, A'B
> > with A and B non-int row keys, is not supported.
> >
> > As a general principle, we followed where our guinea pigs get us, and
> were
> > not trying to fill all possible gaps and holes, with the belief that will
> > get us 80/20 caps in shortest time.
> >
> > As for the rest, we wait for somebody to ask for it because they need it.
> >
> > But that example is legal and patch should be fundamentally possible and
> > easy enough to handle this case within this architecture.
> >
> >
> >
> >
> > On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > also, if something is not supported, such as your example, (if it is
> not
> > > supported), optimizer would simply state so with rejection. But if it
> > takes
> > > it in, then I am pretty sure it will do the right job (or at least
> > there's
> > > a unit test for that case that is asserted on a trivial example).
> > >
> > > Here, by trivial i mean local pipelines for 2-split inputs, that's the
> > > general rule i used.
> > >
> > >
> > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > wrote:
> > >
> > >> a little bit of additional information is that for rewriting rules
> stage
> > >> optimizer does 3 passes over semantic tree, each pass matching a tree
> > >> fragment using Scala case class matching and rewriting. This allows to
> > >> match and rewrite pretty elaborate tree structure fragments, although
> at
> > >> the moment i don't think we dig farther than immediate children, and
> > >> perhaps some their known attributes, in most cases.
> > >>
> > >> More detailed description that that i think is only in reading the
> > source.
> > >>
> > >>
> > >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > >> wrote:
> > >>
> > >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b
> is
> > >>> int-keyed.
> > >>>
> > >>> This is kind of not the point. the point is that you can easily
> modify
> > >>> rewriting rules and operators to cover misses. (there shouldn't be
> > many,
> > >>> since we've already written quite a bit of expressions out there).
> > >>>
> > >>>
> > >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov <dlie...@gmail.com
> >
> > >>> wrote:
> > >>>
> > >>>> I am not sure. There are more rewriting rules than i can remember,
> and
> > >>>> i did not write an algorithm ( i think) that would involve this
> > >>>> combination. I guess the best thing is to try in a shell or a unit
> > test. if
> > >>>> it falls thru, perhaps a new plan element needs to be added
> (although
> > I am
> > >>>> not very sure there isn't already). I know that there are join-based
> > >>>> multiplicative operators there.
> > >>>>
> > >>>>
> > >>>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning <ted.dunn...@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov <
> dlie...@gmail.com
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>> > in simple terms, if non-integer row keying is used anywhere, it
> > >>>>> tries to
> > >>>>> > rewrite pipelines so that product orientations never require
> > non-int
> > >>>>> keys
> > >>>>> > to denote columns. In case pipeline makes it impossible,
> optimizer
> > >>>>> will
> > >>>>> > refuse to produce a plan.
> > >>>>> >
> > >>>>> > e.g. suppose A is distributed string-keyed.
> > >>>>> >
> > >>>>> > (A.t %.% A) collect  // ok
> > >>>>> >
> > >>>>>
> > >>>>> What happens with the important case of  B.t %.% A where both A
> and B
> > >>>>> are
> > >>>>> string keyed?
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>

Re: H2O integration - intermediate progress update

Reply via email to