I agree. That makes total sense from a conceptual standpoint. What's needed to do this? Is the framework in place for Drill to do this?
-- Zelaine On Thu, Nov 5, 2015 at 1:51 PM, Parth Chandra <[email protected]> wrote: > I like the idea of making Parquet/Hive schema'd and returning the schema at > planning time. Front end tools assume that the backend can do a Prepare and > then Execute and this fits that model much better. > > > > On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]> wrote: > > > The only way we get to a few milliseconds is by doing this stuff at > > planning. Let's start by making Parquet schema'd and fixing our implicit > > cast rules. Once completed, we can return schema just through planning > and > > completely skip over execution code (as in every other database). > > > > I'd guess that the top issue is for Parquet and Hive. If that is the > case, > > let's just start treating them as schema'd all the way through. If people > > are begging for fast schema on JSON, let's take the stuff for Parquet and > > Hive and leverage via direct sampling at planning type for the > non-schema'd > > formats. > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]> > > wrote: > > > > > For (3), are you referring to the operators with extend > > > AbstractSingleRecordBatch? We basically only call the buildSchema() > > method > > > on blocking operators. If the operators are not blocking, we simply > > process > > > the first batch, the idea being that it should be fast enough. Are > there > > > situation where this is not true? If we are skipping empty batches, > that > > > could cause a delay in the schema propagation, but we can handle that > > case > > > by having special handling for the first batch. > > > > > > As for (4), its really historical. We originally didn't have fast > schema, > > > and when it was added, only the minimal code changes necessary to make > it > > > work were done. At the time the fast schema feature was implemented, > > there > > > was just the "setup" method of the operators, which handled both > > > materializing the output batch as well as generating the code. It would > > > require additional work as well as potentially adding code complexity > to > > > further separate the parts of setup that are needed for fast schema > from > > > those which are not. And I'm not sure how much benefit we would get > from > > > it. > > > > > > What is the motivation behind this? In other words, what sort of delays > > are > > > you currently seeing? And have you done an analysis of what is causing > > the > > > delay? I would think that code generation would cause only a minimal > > delay, > > > unless we are concerned about cutting the time for "limit 0" queries > down > > > to just a few milliseconds. > > > > > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <[email protected]> > > > wrote: > > > > > > > Hey y’all, > > > > > > > > @Jacques and @Steven, > > > > > > > > I am looking at improving the fast schema path (for LIMIT 0 queries). > > It > > > > seems to me that on the first call to next (the buildSchema call), in > > any > > > > operator, only two tasks need to be done: > > > > 1) call next exactly once on each of the incoming batches, and > > > > 2) setup the output container based on those incoming batches > > > > > > > > However, looking at the implementation, some record batches: > > > > 3) make multiple calls to incoming batches (with a comment “skip > first > > > > batch if count is zero, as it may be an empty schema batch”), > > > > 4) generate code, etc. > > > > > > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations that > > were > > > > considered, but not implemented? > > > > > > > > Thank you, > > > > Sudheesh > > > > > >
