Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Paul Rogers Fri, 09 Nov 2018 13:03:10 -0800

Hi Gautam,

One follow-up clarification: I realize one point was a bit unclear.

I suggested that stats be gathered by a query. By this, I simply mean that 
stats use the existing query mechanism with, perhaps, an enhanced UDAF 
(user-defined aggregate function) API.

Your work includes a new SQL statement to gather stats: COMPUTE STATS ON 
<table> or ANALYZE TABLE <table> ... (I don't recall the specifics.) My 
suggestion is not about that statement itself, rather about how that statement 
is implemented. The current code, if I recall, introduces a new execution model 
which gathers stats and writes a file.

The specific suggestion is that the new ANALYZE TABLE ... syntax is provided. 
But, it is implemented as an extension to a CTAS statement: run a query that 
computes stats, and write that to some file structure. Said another way, 
ANALYZE TABLE is simply an alias for (or is rewritten to) a stats query plus 
output to Drill's preferred internal format.

This model allows others to use queries to compute stats for the reasons 
discussed previously.

I hope this clarifies things a bit...

Also, to be clear, I'm just tossing out ideas to make Drill as useful as 
possible. Whatever we do, would be good to get the existing version into the 
code base so folks can play with it.

Thanks,
- Paul

    On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers 
<[email protected]> wrote:  

 Hi Gautam,

Thanks much for the explanations. You raise some interesting points. I noticed 
that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct 
case.

To take a step back, recall that Arina is working on a metadata proposal. A key 
aspect of that proposal is that it provides an API so that Drill can connect 
(via an adapter) to any metadata system. The gist of my comments is that it 
would be wonderful if stats could work the same way. Provide a generic way to 
compute stats (as a query). That data can then be written to whatever metadata 
system the user wants to use, and served back via Arina's API.

Here is a little ASCII diagram which, I hope, survives e-mail:

Stats query --> custom metastore --> Metadata API --> Drill planner

That first arrow represents a query someone runs that gathers stats, along with 
code to write the results into the custom metastore. Very powerful concept.

Detailed comments below.

> a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.

The point here was a bit more subtle. First, users (and QA) want to know the 
values of stats. Stats-as-query lets people easily play with the stats. I may 
request SELECT ndv(id) FROM ... because I want to know the NDV. The stats 
gathering mechanism may want NDV for all columns, along with other material.

Second, the stats gathering logic will know what stats are wanted for that 
specific purpose. Maybe in the Drill native version, Drill guesses which 
columns should have stats.

But, savvy users will know ahead of time (or from experience) which columns 
justify the cost of stats. If I roll my own metadata system, I may know that it 
is worth gathering stats on a small subset of columns, so I'll issue the query 
to do so.

The key thought is, separate out the mechanism to compute stats from the 
consumption of stats. (BTW, Impala uses a plain query to compute its stats.)

> b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. ...

Excellent. The result would be the ability for Drill to generate efficient 
plans for complex stats: whither those stats are requested by the user or by 
Drill itself. Stats are easier to add: just add the necessary (probably 
complex) functions and the planner takes care of the rest. There are not two 
distinct code paths to reason about and maintain.

> c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.

Stats (histograms, NDV) are complex to compute. Perfect: they force us to 
extend our aggregate UDF functionality so it is powerful enough that community 
members can compute aggregations just as complex as stats. We kill two birds 
with one stone.

> d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.

True. But, imagine how much easier it will be to explain to savvy users or 
Drill developers if you can say: to see the difference between different 
histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10 
buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is, if 
we encapsulate these different types of histograms in a (complex) aggregate 
function, Drill can keep adding new ones as easily as adding an aggregate UDF.

If things are tunable (the user gets to choose the histogram style and size, 
say), then being able to visualize the results will help the user choose wisely.

Further, in training (or the next edition of the Drill book), we can explain 
how stats work by having people play with queries that provide the information.

> e)Having said all that, the current implementation does do what you are
asking for (albeit not for the parallel version) which may be useful for
experimentation.

Very cool. Just need the parallel version and we'll be in great shape.

> Yes, the current approach of storing statistics is emulated from Parquet
metadata. Even though it is riddled with concurrency issues, it does not do
any worse. Hence, I would contend that it is a good starting point. Once,
the meta-store work is complete I plan to integrate statistics with it and
leverage all the great benefits that come with this approach.

This is my very point. We are basing the stats implementation on a broken 
storage format. The suggestion here is to split the problem.

Discussion about metadata pointed out the benefit of separating the use of 
schemas, say, from their storage. Arina proposes an API that can be implemented 
in a variety of ways. Stats are similar, though inverted. Provide an API (a 
query) that computes stats; community users can store those stats in a variety 
of ways. That extension then feeds the data back to Drill via the metadata API.

Again, they key point is: separate the stats computation mechanism (which is 
just a query with sophisticated aggregates) from storage (which needs to be 
pluggable.)

Then, the first out-of-the-box implementation might be based on the Parquet 
file mechanism, but it is just one implementation of possibly many.

You may point out a very reasonable fact: that the code path you describe 
already exists, and so it is less costly, today, to just use that path. You 
should argue that you are not in a position to start over. Very useful to play 
with what we have. But, let's also keep in mind where Drill wants to go: that 
it is unique in its modularity and its ability to integrate with a wide variety 
of systems.

Thanks,
- Paul

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Reply via email to