I have a feeling that large joins will be dealt with sooner rather than later (especially with interest and work from people like you). If you look at large queries, things are dominated by large sorts, large joins and large group-by aggregations. We need to make sure those are performant in large clusters before we focus on the prettier things. Hopefully we can leverage Google Compute Engine to ensure this.
On Wed, Mar 13, 2013 at 7:07 AM, David Alves <[email protected]> wrote: > Hi All > > Sorry to revive an old thread… > I was going through the list looking for the current stance on > joins and I found Ted's answer. > What is the main point behind not doing large joins on Drill? > Is it just simplicity (as in optimizer, etc.) or is there > something else? > I mention this because I'm particularly interested in large self > joins (I'd can volunteer to work on them myself, of course). > I'm not against leaving them out of any optimizer goals, if one > can explicitly select an identity optimizer that will just follow the > logical plan, but they are big requirement for me. > Thoughts? > > Best > David > > On Dec 6, 2012, at 7:33 PM, Ted Dunning <[email protected]> wrote: > > > Drill is explicitly designed (at this time) with the option of not doing > > large joins. Triple stores pretty much assume lots of large joins. > > > > That said, if you could write some suggested typical queries, it would > help > > the discussion along. If you could go so far as to translate to a > logical > > plan, that would be even cooler. > > > > On Fri, Dec 7, 2012 at 2:25 AM, Mike Kogan <[email protected]> wrote: > > > >> I would very much be interested in having a SPARQL interface, though I > am > >> not sure how well Drill will handle many joins. > >> > >> > >> On Thu, Dec 6, 2012 at 5:13 PM, Ted Dunning <[email protected]> > wrote: > >> > >>> On Thu, Dec 6, 2012 at 8:44 PM, Julian Hyde <[email protected]> > >> wrote: > >>> > >>>> ... > >>>> 1 A SQL interface (in addition to DrQL interface) > >>>> > >>> > >>> With your help, this may arrive before DrQL is integrated. > >>> > >>> > >>>> 2 JDBC driver > >>>> > >>> > >>> Should be pretty straightforward. Not on anybody's task list just > yet, I > >>> don't think. > >>> > >>> > >>>> 3 Access to the stack at a lower level (i.e. a way to use the > >>>> high-performance scan operators without writing a query) > >>>> > >>> > >>> Definitely going to happen. > >>> > >>> > >>>> 4 Ability to query in-memory Java data in a compact form (e.g. arrays > >> of > >>>> primitives or nio buffers) > >>>> > >>> > >>> I wonder if this is just a matter of writing a special scanner or a > >> special > >>> flavor of join at the execution point. The scanner for the case where > >> the > >>> in-memory compact form is only readable in sequential form. The > >>> join-operator if the memory can be accessed at random. > >>> > >>> ... > >>>> I know some of these are outside of Drill's scope. If so, feel free to > >>>> disregard. But if you don't ask, you don't get. :) > >>>> > >>> > >>> They all look pretty reasonable to me. > >>> > >> > >
