Re: [DISCUSS] Improving Fast Schema

Jacques Nadeau Thu, 05 Nov 2015 13:25:32 -0800

The only way we get to a few milliseconds is by doing this stuff at
planning. Let's start by making Parquet schema'd and fixing our implicit
cast rules. Once completed, we can return schema just through planning and
completely skip over execution code (as in every other database).


I'd guess that the top issue is for Parquet and Hive. If that is the case,
let's just start treating them as schema'd all the way through. If people
are begging for fast schema on JSON, let's take the stuff for Parquet and
Hive and leverage via direct sampling at planning type for the non-schema'd
formats.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]> wrote:

> For (3), are you referring to the operators with extend
> AbstractSingleRecordBatch? We basically only call the buildSchema() method
> on blocking operators. If the operators are not blocking, we simply process
> the first batch, the idea being that it should be fast enough. Are there
> situation where this is not true? If we are skipping empty batches, that
> could cause a delay in the schema propagation, but we can handle that case
> by having special handling for the first batch.
>
> As for (4), its really historical. We originally didn't have fast schema,
> and when it was added, only the minimal code changes necessary to make it
> work were done. At the time the fast schema feature was implemented, there
> was just the "setup" method of the operators, which handled both
> materializing the output batch as well as generating the code. It would
> require additional work as well as potentially adding code complexity to
> further separate the parts of setup that are needed for fast schema from
> those which are not. And I'm not sure how much benefit we would get from
> it.
>
> What is the motivation behind this? In other words, what sort of delays are
> you currently seeing? And have you done an analysis of what is causing the
> delay? I would think that code generation would cause only a minimal delay,
> unless we are concerned about cutting the time for "limit 0" queries down
> to just a few milliseconds.
>
> On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <[email protected]>
> wrote:
>
> > Hey y’all,
> >
> > @Jacques and @Steven,
> >
> > I am looking at improving the fast schema path (for LIMIT 0 queries). It
> > seems to me that on the first call to next (the buildSchema call), in any
> > operator, only two tasks need to be done:
> > 1) call next exactly once on each of the incoming batches, and
> > 2) setup the output container based on those incoming batches
> >
> > However, looking at the implementation, some record batches:
> > 3) make multiple calls to incoming batches (with a comment “skip first
> > batch if count is zero, as it may be an empty schema batch”),
> > 4) generate code, etc.
> >
> > Any reason why (1) and (2) aren’t sufficient? Any optimizations that were
> > considered, but not implemented?
> >
> > Thank you,
> > Sudheesh
>

Re: [DISCUSS] Improving Fast Schema

Reply via email to