Re: [DISCUSS] Improving Fast Schema

Jacques Nadeau Thu, 05 Nov 2015 14:23:25 -0800

I think we have to resolve these issues. Otherwise, we'll always behave
inconsistently (and confuse the user).


I'm sure we won't find 100% immediately but we can probably get to 99.9%
reasonably quick and then manage the last 0.1% as we learn about them. My
guess is the aggregation outputs (which Calcite just merged an
accommodation for) and tighter implicit casting would get us close to that
99.9% (maybe with one or two other things that Aman and/or Jinfeng could
mention).

So steps to me:

- Be more specific about function outputs at planning time (including
Drill's udfs) (especially focused on aggregates with different output types)
- Make Drill behave like SQL Standard/Caclite for implicit casting rules.
- Leverage this for Hive in planning
- Add support for Parquet merging and schema'd behavior
- Add support for partially dynamic schema'd tables (e.g. i know these 10
columns exist but someone can name another as well)
- Add support for JSON etc sampling behavior for less well-defined types








--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Nov 5, 2015 at 2:10 PM, Jinfeng Ni <[email protected]> wrote:

> DRILL-3623 is originally to get schema in planning time for Hive
> table. Once parquet becomes schema-ed, it could be applied to parquet
> table.
>
> However, there are issues in terms of type resolution. The following
> is the comments I put in the PR for DRILL-3623.
>
> "
>
> The original approach (skipping the execution phase for limit 0
> completely), actually could potentially have issues in some cases, due
> to the difference in Calcite rule and Drill execution rule, in terms
> of how type is determined.
>
> For example, sum(int) in calcite is resolved to be int, while in Drill
> execution, we changed to bigint. Another case is implicit cast.
> Currently, there are some small differences between Calcite and Drill
> execution. That means, if we skip the execution for limit 0, then
> types which are resolved in Calcite could be different from the type
> if the query goes through Drill execution. For BI tool like Tableau,
> that means the type returned from "limit 0" query and type from a
> second query w/o "limit 0" could be different.
>
> If we want to avoid the above issues, we have to detect all those
> cases, which are painful. That's why Sudheesh and I are now more
> inclined to this new approach.
>
> "
>
> Before we could resolve the difference of type resolution between
> planning / execution, we could not directly return schema in planning
> time. One good news is that Calcite recently has fixes to allow
> specification how aggregation type is returned, which should fix the
> first issue.
>
>
> On Thu, Nov 5, 2015 at 1:54 PM, Zelaine Fong <[email protected]> wrote:
> > I agree.  That makes total sense from a conceptual standpoint.  What's
> > needed to do this?  Is the framework in place for Drill to do this?
> >
> > -- Zelaine
> >
> > On Thu, Nov 5, 2015 at 1:51 PM, Parth Chandra <[email protected]> wrote:
> >
> >> I like the idea of making Parquet/Hive schema'd and returning the
> schema at
> >> planning time. Front end tools assume that the backend can do a Prepare
> and
> >> then Execute and this fits that model much better.
> >>
> >>
> >>
> >> On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]>
> wrote:
> >>
> >> > The only way we get to a few milliseconds is by doing this stuff at
> >> > planning. Let's start by making Parquet schema'd and fixing our
> implicit
> >> > cast rules. Once completed, we can return schema just through planning
> >> and
> >> > completely skip over execution code (as in every other database).
> >> >
> >> > I'd guess that the top issue is for Parquet and Hive. If that is the
> >> case,
> >> > let's just start treating them as schema'd all the way through. If
> people
> >> > are begging for fast schema on JSON, let's take the stuff for Parquet
> and
> >> > Hive and leverage via direct sampling at planning type for the
> >> non-schema'd
> >> > formats.
> >> >
> >> > --
> >> > Jacques Nadeau
> >> > CTO and Co-Founder, Dremio
> >> >
> >> > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]>
> >> > wrote:
> >> >
> >> > > For (3), are you referring to the operators with extend
> >> > > AbstractSingleRecordBatch? We basically only call the buildSchema()
> >> > method
> >> > > on blocking operators. If the operators are not blocking, we simply
> >> > process
> >> > > the first batch, the idea being that it should be fast enough. Are
> >> there
> >> > > situation where this is not true? If we are skipping empty batches,
> >> that
> >> > > could cause a delay in the schema propagation, but we can handle
> that
> >> > case
> >> > > by having special handling for the first batch.
> >> > >
> >> > > As for (4), its really historical. We originally didn't have fast
> >> schema,
> >> > > and when it was added, only the minimal code changes necessary to
> make
> >> it
> >> > > work were done. At the time the fast schema feature was implemented,
> >> > there
> >> > > was just the "setup" method of the operators, which handled both
> >> > > materializing the output batch as well as generating the code. It
> would
> >> > > require additional work as well as potentially adding code
> complexity
> >> to
> >> > > further separate the parts of setup that are needed for fast schema
> >> from
> >> > > those which are not. And I'm not sure how much benefit we would get
> >> from
> >> > > it.
> >> > >
> >> > > What is the motivation behind this? In other words, what sort of
> delays
> >> > are
> >> > > you currently seeing? And have you done an analysis of what is
> causing
> >> > the
> >> > > delay? I would think that code generation would cause only a minimal
> >> > delay,
> >> > > unless we are concerned about cutting the time for "limit 0" queries
> >> down
> >> > > to just a few milliseconds.
> >> > >
> >> > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <
> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hey y’all,
> >> > > >
> >> > > > @Jacques and @Steven,
> >> > > >
> >> > > > I am looking at improving the fast schema path (for LIMIT 0
> queries).
> >> > It
> >> > > > seems to me that on the first call to next (the buildSchema
> call), in
> >> > any
> >> > > > operator, only two tasks need to be done:
> >> > > > 1) call next exactly once on each of the incoming batches, and
> >> > > > 2) setup the output container based on those incoming batches
> >> > > >
> >> > > > However, looking at the implementation, some record batches:
> >> > > > 3) make multiple calls to incoming batches (with a comment “skip
> >> first
> >> > > > batch if count is zero, as it may be an empty schema batch”),
> >> > > > 4) generate code, etc.
> >> > > >
> >> > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations
> that
> >> > were
> >> > > > considered, but not implemented?
> >> > > >
> >> > > > Thank you,
> >> > > > Sudheesh
> >> > >
> >> >
> >>
>

Re: [DISCUSS] Improving Fast Schema

Reply via email to