Before we discuss the next release, I would like to explain that Apache project should not be directly linked to a commercial company, otherwise this will affect the motivation of the community to contribute.
Thanks. > On Feb 6, 2022, at 21:29, Charles Givre <[email protected]> wrote: > > Hello all, > Firstly, I wanted to thank everyone for all the work that has gone into Drill > 1.20 as well as the ongoing discussion around Drill 2.0. I wanted to start > a discussion around topic for Drill 1.21 and that is INFO_SCHEMA > improvements. As my company wades further and further into Drill, it has > become apparent that the INFO_SCHEMA could use some attention. James Turton > submitted a PR which was merged into Drill 1.20, but in so doing he uncovered > an entire Pandora's box of other issues which might be worth addressing. In > a nutshell, the issues with the INFO_SCHEMA are all performance related: it > can be very slow and also can consume significant resources when executing > even basic queries. > > My understanding of how the info schema (IS) works is that when a user > executes a query, Drill will attempt to instantiate every enabled storage > plugin to discover schemata and other information. As you might imagine, this > can be costly. > > So, (and again, this is only meant as a conversation starter), I was thinking > there are some general ideas as to how we might improve the IS: > 1. Implement a limit pushdown: As far as I can tell, there is no limit > pushdown in the IS and this could be a relatively quick win for improving IS > query performance. > 2. Caching: I understand that caching is tricky, but perhaps we could add > some sort of schema caching for IS queries, or make better use of the Drill > metastore to reduce the number of connections during IS queries. Perhaps in > combination with the metastore, we could implement some sort of "metastore > first" plan, whereby Drill first hits the metastore for query results and if > the limit is reached, we're done. If not, query the storage plugins... > 3. Parallelization: It did not appear to me that Drill parallelizes IS > queries. We may be able to add some parallelization which would improve > overall speed, but not necessarily reduce overall compute cost > 4. Convert to EVF2: Not sure that there's a performance benefit here, but > at least we could get rid of cruft > 5. Reduce SeDe: I imagine there was a good reason for doing this, but the > IS seems to obtain a POJO from the storage plugin then write these results to > old-school Drill vectors. I'm sure there was a reason it was done this way, > (or maybe not) but I have to wonder if there is a more efficient way of > obtaining the information from the storage plugin, ideally w/o all the object > creation. > > These are just some thoughts, and I'm curious as to what the community thinks > about this. Thanks everyone! > -- C
