Re: [DISCUSS] Improving Fast Schema

Parth Chandra Thu, 05 Nov 2015 13:52:21 -0800

I like the idea of making Parquet/Hive schema'd and returning the schema at
planning time. Front end tools assume that the backend can do a Prepare and
then Execute and this fits that model much better.




On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]> wrote:

> The only way we get to a few milliseconds is by doing this stuff at
> planning. Let's start by making Parquet schema'd and fixing our implicit
> cast rules. Once completed, we can return schema just through planning and
> completely skip over execution code (as in every other database).
>
> I'd guess that the top issue is for Parquet and Hive. If that is the case,
> let's just start treating them as schema'd all the way through. If people
> are begging for fast schema on JSON, let's take the stuff for Parquet and
> Hive and leverage via direct sampling at planning type for the non-schema'd
> formats.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]>
> wrote:
>
> > For (3), are you referring to the operators with extend
> > AbstractSingleRecordBatch? We basically only call the buildSchema()
> method
> > on blocking operators. If the operators are not blocking, we simply
> process
> > the first batch, the idea being that it should be fast enough. Are there
> > situation where this is not true? If we are skipping empty batches, that
> > could cause a delay in the schema propagation, but we can handle that
> case
> > by having special handling for the first batch.
> >
> > As for (4), its really historical. We originally didn't have fast schema,
> > and when it was added, only the minimal code changes necessary to make it
> > work were done. At the time the fast schema feature was implemented,
> there
> > was just the "setup" method of the operators, which handled both
> > materializing the output batch as well as generating the code. It would
> > require additional work as well as potentially adding code complexity to
> > further separate the parts of setup that are needed for fast schema from
> > those which are not. And I'm not sure how much benefit we would get from
> > it.
> >
> > What is the motivation behind this? In other words, what sort of delays
> are
> > you currently seeing? And have you done an analysis of what is causing
> the
> > delay? I would think that code generation would cause only a minimal
> delay,
> > unless we are concerned about cutting the time for "limit 0" queries down
> > to just a few milliseconds.
> >
> > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <[email protected]>
> > wrote:
> >
> > > Hey y’all,
> > >
> > > @Jacques and @Steven,
> > >
> > > I am looking at improving the fast schema path (for LIMIT 0 queries).
> It
> > > seems to me that on the first call to next (the buildSchema call), in
> any
> > > operator, only two tasks need to be done:
> > > 1) call next exactly once on each of the incoming batches, and
> > > 2) setup the output container based on those incoming batches
> > >
> > > However, looking at the implementation, some record batches:
> > > 3) make multiple calls to incoming batches (with a comment “skip first
> > > batch if count is zero, as it may be an empty schema batch”),
> > > 4) generate code, etc.
> > >
> > > Any reason why (1) and (2) aren’t sufficient? Any optimizations that
> were
> > > considered, but not implemented?
> > >
> > > Thank you,
> > > Sudheesh
> >
>

Re: [DISCUSS] Improving Fast Schema

Reply via email to