Hello all, 
Firstly, I wanted to thank everyone for all the work that has gone into Drill 
1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to start a 
discussion around topic for Drill 1.21 and that is INFO_SCHEMA improvements.  
As my company wades further and further into Drill, it has become apparent that 
the INFO_SCHEMA could use some attention.  James Turton submitted a PR which 
was merged into Drill 1.20, but in so doing he uncovered an entire Pandora's 
box of other issues which might be worth addressing.  In a nutshell, the issues 
with the INFO_SCHEMA are all performance related: it can be very slow and also 
can consume significant resources when executing even basic queries.  

My understanding of how the info schema (IS) works is that when a user executes 
a query, Drill will attempt to instantiate every enabled storage plugin to 
discover schemata and other information. As you might imagine, this can be 
costly. 

So, (and again, this is only meant as a conversation starter), I was thinking 
there are some general ideas as to how we might improve the IS:
1.  Implement a limit pushdown:  As far as I can tell, there is no limit 
pushdown in the IS and this could be a relatively quick win for improving IS 
query performance.
2.  Caching:  I understand that caching is tricky, but perhaps we could add 
some sort of schema caching for IS queries, or make better use of the Drill 
metastore to reduce the number of connections during IS queries.  Perhaps in 
combination with the metastore, we could implement some sort of "metastore 
first" plan, whereby Drill first hits the metastore for query results and if 
the limit is reached, we're done.  If not, query the storage plugins...
3.  Parallelization:  It did not appear to me that Drill parallelizes IS 
queries.   We may be able to add some parallelization which would improve 
overall speed, but not necessarily reduce overall compute cost
4.  Convert to EVF2:  Not sure that there's a performance benefit here, but at 
least we could get rid of cruft
5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the IS 
seems to obtain a POJO from the storage plugin then write these results to 
old-school Drill vectors.  I'm sure there was a reason it was done this way, 
(or maybe not) but I have to wonder if there is a more efficient way of 
obtaining the information from the storage plugin, ideally w/o all the object 
creation. 

These are just some thoughts, and I'm curious as to what the community thinks 
about this.  Thanks everyone!
-- C

Reply via email to