I think we have to resolve these issues. Otherwise, we'll always behave inconsistently (and confuse the user).
I'm sure we won't find 100% immediately but we can probably get to 99.9% reasonably quick and then manage the last 0.1% as we learn about them. My guess is the aggregation outputs (which Calcite just merged an accommodation for) and tighter implicit casting would get us close to that 99.9% (maybe with one or two other things that Aman and/or Jinfeng could mention). So steps to me: - Be more specific about function outputs at planning time (including Drill's udfs) (especially focused on aggregates with different output types) - Make Drill behave like SQL Standard/Caclite for implicit casting rules. - Leverage this for Hive in planning - Add support for Parquet merging and schema'd behavior - Add support for partially dynamic schema'd tables (e.g. i know these 10 columns exist but someone can name another as well) - Add support for JSON etc sampling behavior for less well-defined types -- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Nov 5, 2015 at 2:10 PM, Jinfeng Ni <[email protected]> wrote: > DRILL-3623 is originally to get schema in planning time for Hive > table. Once parquet becomes schema-ed, it could be applied to parquet > table. > > However, there are issues in terms of type resolution. The following > is the comments I put in the PR for DRILL-3623. > > " > > The original approach (skipping the execution phase for limit 0 > completely), actually could potentially have issues in some cases, due > to the difference in Calcite rule and Drill execution rule, in terms > of how type is determined. > > For example, sum(int) in calcite is resolved to be int, while in Drill > execution, we changed to bigint. Another case is implicit cast. > Currently, there are some small differences between Calcite and Drill > execution. That means, if we skip the execution for limit 0, then > types which are resolved in Calcite could be different from the type > if the query goes through Drill execution. For BI tool like Tableau, > that means the type returned from "limit 0" query and type from a > second query w/o "limit 0" could be different. > > If we want to avoid the above issues, we have to detect all those > cases, which are painful. That's why Sudheesh and I are now more > inclined to this new approach. > > " > > Before we could resolve the difference of type resolution between > planning / execution, we could not directly return schema in planning > time. One good news is that Calcite recently has fixes to allow > specification how aggregation type is returned, which should fix the > first issue. > > > On Thu, Nov 5, 2015 at 1:54 PM, Zelaine Fong <[email protected]> wrote: > > I agree. That makes total sense from a conceptual standpoint. What's > > needed to do this? Is the framework in place for Drill to do this? > > > > -- Zelaine > > > > On Thu, Nov 5, 2015 at 1:51 PM, Parth Chandra <[email protected]> wrote: > > > >> I like the idea of making Parquet/Hive schema'd and returning the > schema at > >> planning time. Front end tools assume that the backend can do a Prepare > and > >> then Execute and this fits that model much better. > >> > >> > >> > >> On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]> > wrote: > >> > >> > The only way we get to a few milliseconds is by doing this stuff at > >> > planning. Let's start by making Parquet schema'd and fixing our > implicit > >> > cast rules. Once completed, we can return schema just through planning > >> and > >> > completely skip over execution code (as in every other database). > >> > > >> > I'd guess that the top issue is for Parquet and Hive. If that is the > >> case, > >> > let's just start treating them as schema'd all the way through. If > people > >> > are begging for fast schema on JSON, let's take the stuff for Parquet > and > >> > Hive and leverage via direct sampling at planning type for the > >> non-schema'd > >> > formats. > >> > > >> > -- > >> > Jacques Nadeau > >> > CTO and Co-Founder, Dremio > >> > > >> > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]> > >> > wrote: > >> > > >> > > For (3), are you referring to the operators with extend > >> > > AbstractSingleRecordBatch? We basically only call the buildSchema() > >> > method > >> > > on blocking operators. If the operators are not blocking, we simply > >> > process > >> > > the first batch, the idea being that it should be fast enough. Are > >> there > >> > > situation where this is not true? If we are skipping empty batches, > >> that > >> > > could cause a delay in the schema propagation, but we can handle > that > >> > case > >> > > by having special handling for the first batch. > >> > > > >> > > As for (4), its really historical. We originally didn't have fast > >> schema, > >> > > and when it was added, only the minimal code changes necessary to > make > >> it > >> > > work were done. At the time the fast schema feature was implemented, > >> > there > >> > > was just the "setup" method of the operators, which handled both > >> > > materializing the output batch as well as generating the code. It > would > >> > > require additional work as well as potentially adding code > complexity > >> to > >> > > further separate the parts of setup that are needed for fast schema > >> from > >> > > those which are not. And I'm not sure how much benefit we would get > >> from > >> > > it. > >> > > > >> > > What is the motivation behind this? In other words, what sort of > delays > >> > are > >> > > you currently seeing? And have you done an analysis of what is > causing > >> > the > >> > > delay? I would think that code generation would cause only a minimal > >> > delay, > >> > > unless we are concerned about cutting the time for "limit 0" queries > >> down > >> > > to just a few milliseconds. > >> > > > >> > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam < > [email protected]> > >> > > wrote: > >> > > > >> > > > Hey y’all, > >> > > > > >> > > > @Jacques and @Steven, > >> > > > > >> > > > I am looking at improving the fast schema path (for LIMIT 0 > queries). > >> > It > >> > > > seems to me that on the first call to next (the buildSchema > call), in > >> > any > >> > > > operator, only two tasks need to be done: > >> > > > 1) call next exactly once on each of the incoming batches, and > >> > > > 2) setup the output container based on those incoming batches > >> > > > > >> > > > However, looking at the implementation, some record batches: > >> > > > 3) make multiple calls to incoming batches (with a comment “skip > >> first > >> > > > batch if count is zero, as it may be an empty schema batch”), > >> > > > 4) generate code, etc. > >> > > > > >> > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations > that > >> > were > >> > > > considered, but not implemented? > >> > > > > >> > > > Thank you, > >> > > > Sudheesh > >> > > > >> > > >> >
