Hi Gautam,
One follow-up clarification: I realize one point was a bit unclear.
I suggested that stats be gathered by a query. By this, I simply mean that
stats use the existing query mechanism with, perhaps, an enhanced UDAF
(user-defined aggregate function) API.
Your work includes a new SQL statement to gather stats: COMPUTE STATS ON
<table> or ANALYZE TABLE <table> ... (I don't recall the specifics.) My
suggestion is not about that statement itself, rather about how that statement
is implemented. The current code, if I recall, introduces a new execution model
which gathers stats and writes a file.
The specific suggestion is that the new ANALYZE TABLE ... syntax is provided.
But, it is implemented as an extension to a CTAS statement: run a query that
computes stats, and write that to some file structure. Said another way,
ANALYZE TABLE is simply an alias for (or is rewritten to) a stats query plus
output to Drill's preferred internal format.
This model allows others to use queries to compute stats for the reasons
discussed previously.
I hope this clarifies things a bit...
Also, to be clear, I'm just tossing out ideas to make Drill as useful as
possible. Whatever we do, would be good to get the existing version into the
code base so folks can play with it.
Thanks,
- Paul
On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers
<[email protected]> wrote:
Hi Gautam,
Thanks much for the explanations. You raise some interesting points. I noticed
that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct
case.
To take a step back, recall that Arina is working on a metadata proposal. A key
aspect of that proposal is that it provides an API so that Drill can connect
(via an adapter) to any metadata system. The gist of my comments is that it
would be wonderful if stats could work the same way. Provide a generic way to
compute stats (as a query). That data can then be written to whatever metadata
system the user wants to use, and served back via Arina's API.
Here is a little ASCII diagram which, I hope, survives e-mail:
Stats query --> custom metastore --> Metadata API --> Drill planner
That first arrow represents a query someone runs that gathers stats, along with
code to write the results into the custom metastore. Very powerful concept.
Detailed comments below.
> a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.
The point here was a bit more subtle. First, users (and QA) want to know the
values of stats. Stats-as-query lets people easily play with the stats. I may
request SELECT ndv(id) FROM ... because I want to know the NDV. The stats
gathering mechanism may want NDV for all columns, along with other material.
Second, the stats gathering logic will know what stats are wanted for that
specific purpose. Maybe in the Drill native version, Drill guesses which
columns should have stats.
But, savvy users will know ahead of time (or from experience) which columns
justify the cost of stats. If I roll my own metadata system, I may know that it
is worth gathering stats on a small subset of columns, so I'll issue the query
to do so.
The key thought is, separate out the mechanism to compute stats from the
consumption of stats. (BTW, Impala uses a plain query to compute its stats.)
> b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. ...
Excellent. The result would be the ability for Drill to generate efficient
plans for complex stats: whither those stats are requested by the user or by
Drill itself. Stats are easier to add: just add the necessary (probably
complex) functions and the planner takes care of the rest. There are not two
distinct code paths to reason about and maintain.
> c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.
Stats (histograms, NDV) are complex to compute. Perfect: they force us to
extend our aggregate UDF functionality so it is powerful enough that community
members can compute aggregations just as complex as stats. We kill two birds
with one stone.
> d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.
True. But, imagine how much easier it will be to explain to savvy users or
Drill developers if you can say: to see the difference between different
histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10
buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is, if
we encapsulate these different types of histograms in a (complex) aggregate
function, Drill can keep adding new ones as easily as adding an aggregate UDF.
If things are tunable (the user gets to choose the histogram style and size,
say), then being able to visualize the results will help the user choose wisely.
Further, in training (or the next edition of the Drill book), we can explain
how stats work by having people play with queries that provide the information.
> e)Having said all that, the current implementation does do what you are
asking for (albeit not for the parallel version) which may be useful for
experimentation.
Very cool. Just need the parallel version and we'll be in great shape.
> Yes, the current approach of storing statistics is emulated from Parquet
metadata. Even though it is riddled with concurrency issues, it does not do
any worse. Hence, I would contend that it is a good starting point. Once,
the meta-store work is complete I plan to integrate statistics with it and
leverage all the great benefits that come with this approach.
This is my very point. We are basing the stats implementation on a broken
storage format. The suggestion here is to split the problem.
Discussion about metadata pointed out the benefit of separating the use of
schemas, say, from their storage. Arina proposes an API that can be implemented
in a variety of ways. Stats are similar, though inverted. Provide an API (a
query) that computes stats; community users can store those stats in a variety
of ways. That extension then feeds the data back to Drill via the metadata API.
Again, they key point is: separate the stats computation mechanism (which is
just a query with sophisticated aggregates) from storage (which needs to be
pluggable.)
Then, the first out-of-the-box implementation might be based on the Parquet
file mechanism, but it is just one implementation of possibly many.
You may point out a very reasonable fact: that the code path you describe
already exists, and so it is less costly, today, to just use that path. You
should argue that you are not in a position to start over. Very useful to play
with what we have. But, let's also keep in mind where Drill wants to go: that
it is unique in its modularity and its ability to integrate with a wide variety
of systems.
Thanks,
- Paul