Would it not be possible (or even a good idea) to keep row keys completely separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the operators (so far) care about the keys. At least none of the existing mapBlock() users do anything with the key. I'm not sure if we can do anything meaningful with the key in a mapBlock. It feels they are tightly coupled while they need not have been. I must admit I'm new to this, but it feels like - keys could be stored in a separate file, and matrix numbers in another. Mahout (should) only care about and operate on Matrix numbers, reads from the "number" file, writes output to a new "number" file, and the user can use the new number file with the old/original "key file" - effectively the same result as loading keys and moving them around through all the operations and writing back. Am I missing something fundamental?
Thanks On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Looking at the code, i am still not sure without trying. > > but i am more inclined to think now that this specific combination, A'B > with A and B non-int row keys, is not supported. > > As a general principle, we followed where our guinea pigs get us, and were > not trying to fill all possible gaps and holes, with the belief that will > get us 80/20 caps in shortest time. > > As for the rest, we wait for somebody to ask for it because they need it. > > But that example is legal and patch should be fundamentally possible and > easy enough to handle this case within this architecture. > > > > > On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > also, if something is not supported, such as your example, (if it is not > > supported), optimizer would simply state so with rejection. But if it > takes > > it in, then I am pretty sure it will do the right job (or at least > there's > > a unit test for that case that is asserted on a trivial example). > > > > Here, by trivial i mean local pipelines for 2-split inputs, that's the > > general rule i used. > > > > > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > >> a little bit of additional information is that for rewriting rules stage > >> optimizer does 3 passes over semantic tree, each pass matching a tree > >> fragment using Scala case class matching and rewriting. This allows to > >> match and rewrite pretty elaborate tree structure fragments, although at > >> the moment i don't think we dig farther than immediate children, and > >> perhaps some their known attributes, in most cases. > >> > >> More detailed description that that i think is only in reading the > source. > >> > >> > >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov <dlie...@gmail.com> > >> wrote: > >> > >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is > >>> int-keyed. > >>> > >>> This is kind of not the point. the point is that you can easily modify > >>> rewriting rules and operators to cover misses. (there shouldn't be > many, > >>> since we've already written quite a bit of expressions out there). > >>> > >>> > >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov <dlie...@gmail.com> > >>> wrote: > >>> > >>>> I am not sure. There are more rewriting rules than i can remember, and > >>>> i did not write an algorithm ( i think) that would involve this > >>>> combination. I guess the best thing is to try in a shell or a unit > test. if > >>>> it falls thru, perhaps a new plan element needs to be added (although > I am > >>>> not very sure there isn't already). I know that there are join-based > >>>> multiplicative operators there. > >>>> > >>>> > >>>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning <ted.dunn...@gmail.com> > >>>> wrote: > >>>> > >>>>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov <dlie...@gmail.com > > > >>>>> wrote: > >>>>> > >>>>> > in simple terms, if non-integer row keying is used anywhere, it > >>>>> tries to > >>>>> > rewrite pipelines so that product orientations never require > non-int > >>>>> keys > >>>>> > to denote columns. In case pipeline makes it impossible, optimizer > >>>>> will > >>>>> > refuse to produce a plan. > >>>>> > > >>>>> > e.g. suppose A is distributed string-keyed. > >>>>> > > >>>>> > (A.t %.% A) collect // ok > >>>>> > > >>>>> > >>>>> What happens with the important case of B.t %.% A where both A and B > >>>>> are > >>>>> string keyed? > >>>>> > >>>> > >>>> > >>> > >> > > >