Re: [DISCUSS] Improving Fast Schema

Zelaine Fong Thu, 05 Nov 2015 13:56:30 -0800

I agree.  That makes total sense from a conceptual standpoint.  What's
needed to do this?  Is the framework in place for Drill to do this?


-- Zelaine

On Thu, Nov 5, 2015 at 1:51 PM, Parth Chandra <[email protected]> wrote:

> I like the idea of making Parquet/Hive schema'd and returning the schema at
> planning time. Front end tools assume that the backend can do a Prepare and
> then Execute and this fits that model much better.
>
>
>
> On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]> wrote:
>
> > The only way we get to a few milliseconds is by doing this stuff at
> > planning. Let's start by making Parquet schema'd and fixing our implicit
> > cast rules. Once completed, we can return schema just through planning
> and
> > completely skip over execution code (as in every other database).
> >
> > I'd guess that the top issue is for Parquet and Hive. If that is the
> case,
> > let's just start treating them as schema'd all the way through. If people
> > are begging for fast schema on JSON, let's take the stuff for Parquet and
> > Hive and leverage via direct sampling at planning type for the
> non-schema'd
> > formats.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]>
> > wrote:
> >
> > > For (3), are you referring to the operators with extend
> > > AbstractSingleRecordBatch? We basically only call the buildSchema()
> > method
> > > on blocking operators. If the operators are not blocking, we simply
> > process
> > > the first batch, the idea being that it should be fast enough. Are
> there
> > > situation where this is not true? If we are skipping empty batches,
> that
> > > could cause a delay in the schema propagation, but we can handle that
> > case
> > > by having special handling for the first batch.
> > >
> > > As for (4), its really historical. We originally didn't have fast
> schema,
> > > and when it was added, only the minimal code changes necessary to make
> it
> > > work were done. At the time the fast schema feature was implemented,
> > there
> > > was just the "setup" method of the operators, which handled both
> > > materializing the output batch as well as generating the code. It would
> > > require additional work as well as potentially adding code complexity
> to
> > > further separate the parts of setup that are needed for fast schema
> from
> > > those which are not. And I'm not sure how much benefit we would get
> from
> > > it.
> > >
> > > What is the motivation behind this? In other words, what sort of delays
> > are
> > > you currently seeing? And have you done an analysis of what is causing
> > the
> > > delay? I would think that code generation would cause only a minimal
> > delay,
> > > unless we are concerned about cutting the time for "limit 0" queries
> down
> > > to just a few milliseconds.
> > >
> > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <[email protected]>
> > > wrote:
> > >
> > > > Hey y’all,
> > > >
> > > > @Jacques and @Steven,
> > > >
> > > > I am looking at improving the fast schema path (for LIMIT 0 queries).
> > It
> > > > seems to me that on the first call to next (the buildSchema call), in
> > any
> > > > operator, only two tasks need to be done:
> > > > 1) call next exactly once on each of the incoming batches, and
> > > > 2) setup the output container based on those incoming batches
> > > >
> > > > However, looking at the implementation, some record batches:
> > > > 3) make multiple calls to incoming batches (with a comment “skip
> first
> > > > batch if count is zero, as it may be an empty schema batch”),
> > > > 4) generate code, etc.
> > > >
> > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations that
> > were
> > > > considered, but not implemented?
> > > >
> > > > Thank you,
> > > > Sudheesh
> > >
> >
>

Re: [DISCUSS] Improving Fast Schema

Reply via email to