Before we discuss the next release, I would like to explain that Apache project 
should not be directly linked to a commercial company, otherwise this will 
affect the motivation of the community to contribute.

Thanks.

> On Feb 6, 2022, at 21:29, Charles Givre <[email protected]> wrote:
> 
> Hello all, 
> Firstly, I wanted to thank everyone for all the work that has gone into Drill 
> 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to start 
> a discussion around topic for Drill 1.21 and that is INFO_SCHEMA 
> improvements.  As my company wades further and further into Drill, it has 
> become apparent that the INFO_SCHEMA could use some attention.  James Turton 
> submitted a PR which was merged into Drill 1.20, but in so doing he uncovered 
> an entire Pandora's box of other issues which might be worth addressing.  In 
> a nutshell, the issues with the INFO_SCHEMA are all performance related: it 
> can be very slow and also can consume significant resources when executing 
> even basic queries.  
> 
> My understanding of how the info schema (IS) works is that when a user 
> executes a query, Drill will attempt to instantiate every enabled storage 
> plugin to discover schemata and other information. As you might imagine, this 
> can be costly. 
> 
> So, (and again, this is only meant as a conversation starter), I was thinking 
> there are some general ideas as to how we might improve the IS:
> 1.  Implement a limit pushdown:  As far as I can tell, there is no limit 
> pushdown in the IS and this could be a relatively quick win for improving IS 
> query performance.
> 2.  Caching:  I understand that caching is tricky, but perhaps we could add 
> some sort of schema caching for IS queries, or make better use of the Drill 
> metastore to reduce the number of connections during IS queries.  Perhaps in 
> combination with the metastore, we could implement some sort of "metastore 
> first" plan, whereby Drill first hits the metastore for query results and if 
> the limit is reached, we're done.  If not, query the storage plugins...
> 3.  Parallelization:  It did not appear to me that Drill parallelizes IS 
> queries.   We may be able to add some parallelization which would improve 
> overall speed, but not necessarily reduce overall compute cost
> 4.  Convert to EVF2:  Not sure that there's a performance benefit here, but 
> at least we could get rid of cruft
> 5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the 
> IS seems to obtain a POJO from the storage plugin then write these results to 
> old-school Drill vectors.  I'm sure there was a reason it was done this way, 
> (or maybe not) but I have to wonder if there is a more efficient way of 
> obtaining the information from the storage plugin, ideally w/o all the object 
> creation. 
> 
> These are just some thoughts, and I'm curious as to what the community thinks 
> about this.  Thanks everyone!
> -- C

Reply via email to