I like the idea of making Parquet/Hive schema'd and returning the schema at planning time. Front end tools assume that the backend can do a Prepare and then Execute and this fits that model much better.
On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <[email protected]> wrote: > The only way we get to a few milliseconds is by doing this stuff at > planning. Let's start by making Parquet schema'd and fixing our implicit > cast rules. Once completed, we can return schema just through planning and > completely skip over execution code (as in every other database). > > I'd guess that the top issue is for Parquet and Hive. If that is the case, > let's just start treating them as schema'd all the way through. If people > are begging for fast schema on JSON, let's take the stuff for Parquet and > Hive and leverage via direct sampling at planning type for the non-schema'd > formats. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <[email protected]> > wrote: > > > For (3), are you referring to the operators with extend > > AbstractSingleRecordBatch? We basically only call the buildSchema() > method > > on blocking operators. If the operators are not blocking, we simply > process > > the first batch, the idea being that it should be fast enough. Are there > > situation where this is not true? If we are skipping empty batches, that > > could cause a delay in the schema propagation, but we can handle that > case > > by having special handling for the first batch. > > > > As for (4), its really historical. We originally didn't have fast schema, > > and when it was added, only the minimal code changes necessary to make it > > work were done. At the time the fast schema feature was implemented, > there > > was just the "setup" method of the operators, which handled both > > materializing the output batch as well as generating the code. It would > > require additional work as well as potentially adding code complexity to > > further separate the parts of setup that are needed for fast schema from > > those which are not. And I'm not sure how much benefit we would get from > > it. > > > > What is the motivation behind this? In other words, what sort of delays > are > > you currently seeing? And have you done an analysis of what is causing > the > > delay? I would think that code generation would cause only a minimal > delay, > > unless we are concerned about cutting the time for "limit 0" queries down > > to just a few milliseconds. > > > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <[email protected]> > > wrote: > > > > > Hey y’all, > > > > > > @Jacques and @Steven, > > > > > > I am looking at improving the fast schema path (for LIMIT 0 queries). > It > > > seems to me that on the first call to next (the buildSchema call), in > any > > > operator, only two tasks need to be done: > > > 1) call next exactly once on each of the incoming batches, and > > > 2) setup the output container based on those incoming batches > > > > > > However, looking at the implementation, some record batches: > > > 3) make multiple calls to incoming batches (with a comment “skip first > > > batch if count is zero, as it may be an empty schema batch”), > > > 4) generate code, etc. > > > > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations that > were > > > considered, but not implemented? > > > > > > Thank you, > > > Sudheesh > > >
