Re: Push-down of operations for SystemSchema tables

Samarth Jain Wed, 19 May 2021 22:53:07 -0700

Another somewhat crazy idea I had was to see if we can have sys tables as
materialized views with some refresh cadence in the metadata store. The
kind of queries that the UI is generating can be easily and efficiently
handled by most of the commonly used relational databases.


Another crazier idea was to introduce a mechanism where we can have sys
tables backed by Druid datasources. Everytime a new segment is added or a
new task is spawned, it would generate a kafka or kinesis event that can be
used to update the Druid datasource backing the corresponding sys table.
The only drawback here is that such datasources won't have effective
rollups since segment id/taskid will generally be unique. But combined with
a reasonable TTL controlling the amount of data in these datasources, it
may just be good enough.

On Wed, May 19, 2021 at 10:04 PM Gian Merlino <g...@apache.org> wrote:

> Hey Frank,
>
> These notes are really interesting. Thanks for writing them down.
>
> I agree that the three things you laid out are all important. With regard
> to SQL clauses from the web console, I did notice one recent change went in
> that changed the SQL clauses to only query sys.segments for columns that
> are actually visible, part of https://github.com/apache/druid/pull/10909.
> That isn't very useful right now, since there isn't projection pushdown.
> But if we add it, this will limit JSON serialization to only the fields
> that are actually requested, which will be useful if not all of them are
> requested by default. Switching to use OFFSET / LIMIT for tasks too would
> also be good (or even just LIMIT would be a good start).
>
> Out of curiosity how many tasks do you typically have in your sys.tasks
> table?
>
> Side note: I'm not sure if you looked into
> druid.indexer.storage.recentlyFinishedThreshold, but that might be useful
> as a workaround for you until some of these changes are made. You can set
> it lower and it will reduce the number of complete tasks that the APIs
> return.
>
> On Tue, May 18, 2021 at 8:13 AM Chen Frank <frank.chen...@outlook.com>
> wrote:
>
> > Hi Jason
> >
> > I have tracked this problem for quite a while. Since you are interested
> in
> > it, I would like to share something I know with you so that you could
> take
> > these in consideration.
> >
> > In 0.19.0, there was a PR #9883 improving the performance of segments
> > query by eliminating the JSON serialization.
> > But PR #10752 merged in 0.21.0 brings back JSON serialization. I do not
> > know whether this change reverts the performance gain in previous PR.
> >
> > For tasks, the performance is much worse. There are some problems
> reported
> > about task UI, e.g. #11042 and #11140. But I do not see any feedback on
> > segment UI.
> > One reason is that the web-console fetches ALL task records from broker
> > and does pagination at client side instead of using a LIMIT clause in SQL
> > to do pagination at server side.
> > Another reason is that broker fetches ALL tasks via REST API from
> overlord
> > that loads records from metadata storage directly and deserializes data
> > from `pay_load` field.
> >
> > While For segments, the two problems above do not exist because
> >
> > 1.     LIMIT clause is used in SQL queries
> >
> > 2.     segments query returns a snapshot in-memory segment data which
> > means there is no query to metadata database and JSON deserialization of
> > `pay_load` field.
> >
> > In 0.20, OFFSET is supported for SQL queries, I think this could also be
> > added to the queries from web console which would bring some performance
> > gain in some extent.
> >
> > IMO, to improve the performance, we might need to make changes to
> >
> > 1.     the SQL layer you mentioned above
> >
> > 2.     the SQL clauses from web console
> >
> > 3.     the task REST API to support search conditions and ordering to
> > narrow down the search range on metadata table
> >
> > Thanks.
> >
> > 发件人: Jason Koch <jk...@netflix.com.INVALID>
> > 日期: 星期六, 2021年5月15日 上午3:51
> > 收件人: dev@druid.apache.org <dev@druid.apache.org>
> > 主题: Re: Push-down of operations for SystemSchema tables
> > @Julian - thank you for review & confirming.
> >
> > Hi Clint
> >
> > Thank you, I appreciate the response. I have responded Inline, some
> > q's, I've also written in my words as a confirmation that I understand
> > ...
> >
> > > In the mid term, I think that some of us have been thinking that moving
> > > system tables into the Druid native query engine is the way to go, and
> > have
> > > been working on resolving a number of hurdles that are required to make
> > > this happen. One of the main motivators to do this is so that we have
> > just
> > > the Druid query path in the planner in the Calcite layer, and
> deprecating
> > > and eventually dropping the "bindable" path completely, described in
> > > https://github.com/apache/druid/issues/9896. System tables would be
> > pushed
> > > into Druid Datasource implementations, and queries would be handled in
> > the
> > > native engine. Gian has even made a prototype of what this might look
> > like,
> > >
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > > since much of the ground work is now in place, though it takes a
> > hard-line
> > > approach of completely removing bindable instead of hiding it behind a
> > > flag, and doesn't implement all of the system tables yet, at least last
> > > time I looked at it.
> >
> > Looking over the changes it seems that:
> > - a new VirtualDataSource is introduced, which the Druid non-sql
> > processing engine can process, that can wrap an Iterable. This exposes
> > lazy segment & iterable using  InlineDataSource.
> > - the SegmentsTable has been converted from a ScannableTable to a
> > DruidTable, and a ScannableTableIterator is introduced to generate an
> > iterable containing the rows; the new VirtualDataSource can be used to
> > access the rows of this table.
> > - finally, the Bindable convention is discarded from DruidPlanner and
> > Rules.
> >
> > > I think there are a couple of remaining parts to resolve that would
> make
> > > this feasible. The first is native scan queries need support for
> ordering
> > > by arbitrary columns, instead of just time, so that we can retain
> > > capabilities of the existing system tables.
> >
> > It seems you want to use the native queries to support ordering; do
> > you mean here the underlying SegmentsTable, or something in the Druid
> > engine? Currently, the SegmentsTable etc relies on, as you say, the
> > bindable convention to provide sort. If it was a DruidTable then it
> > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
> > which conceptually is able to do a sort, but as described in [1] [2]
> > the ordering is not supported by the underlying druid engine [3].
> >
> > This would mean that an order by, sort, limit query would not be
> > supported on any of the migrated sys.* tables until Druid has a way to
> > perform the sort on a ScanQuery.
> >
> > [1]
> >
> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
> > [2]
> >
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
> > [3]
> >
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
> >
> > > This isn't actually a blocker
> > > for adding native system table queries, but rather a blocker for
> > replacing
> > > the bindable convention by default so that there isn't a loss (or
> rather
> > > trade) of functionality. Additionally, I think there is maybe some
> > matters
> > > regarding authorization of system tables when handled by the native
> > engine
> > > that will need resolved, but this can be done while adding the native
> > > implementations.
> >
> > It looks like the port of the tables from classic ScannableTable to a
> > DruidTable itself is straightforward. However, it seems this PR
> > doesn't bring them across from SQL domain to be available in any
> > native queries. I'm not sure if this is expected or an interim step or
> > if I have misunderstood the goal.
> >
> > > I think there are some various ideas and experiments underway of how to
> > do
> > > sorting on scan queries at normal Druid datasource scale, which is sort
> > of
> > > a big project, but in the short term we might be able to do something
> > less
> > > ambitious that works well enough at system tables scale to allow this
> > plan
> > > to fully proceed.
> >
> > One possible way, that I think leads in the correct direction:
> > 1) We have an existing rule for LogicalTable with DruidTable to
> > DruidQueryRel which can eventually construct a DruidQuery.
> > 2) The VirtualDataSource, created during SQL parsing takes an
> > already-constructed Iterable; so, we need to have already performed
> > the filter/sort before creating the VirtualDataSource (and
> > DruidQuery). This means the push-down filter logic has to happen
> > during sql/ stage setup and before handoff to processing/ engine.
> > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
> > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
> > push down? Then, our SegmentTable and friends can expose the correct
> > Iterable. This should allow us to solve the perf concerns, and would
> > allow us to present a correctly constructed VirtualDataSource. Sort
> > from SQL _should_ be supported (I think) as the planner can push the
> > sort etc down to these nodes directly.
> >
> > In this, the majority of the work would have had to have happened
> > prior to Druid engine, in sql/, before reaching Druid and so Druid
> > core doesn't actually need to know anything about these changes.
> >
> > On the other hand, whilst it keeps the pathway open, I'm not sure this
> > does any of the actual work to make the sys.* tables available as
> > native tables. If we are to try and make these into truly native
> > tables, without a native sort, and remove their implementation from
> > sql/, the DruidQuery in the planner would need to be configured to
> > pass the ScanQuery sort to the processing engine _but only for sys.*
> > tables_ and then processing engine would need to know how to find
> > these tables. (I haven't explored this). As you mention, implementing
> > native sort across multiple data sources seems like a more ambitious
> > piece of work.
> >
> > As another idea, we could consider creating a bridge
> > Bindable/EnumerableToDruid rule that would allow druid to embed these
> > tables, and move them out of sql/ into processing/, exposed as
> > Iterable/Enumerable, and make them available in queries if that is a
> > goal. I'm not really sure that adds anything to the overall goals
> > though.
> >
> > > Does this approach make sense? I don't believe Gian is actively working
> > on
> > > this at the moment, so I think if you're interested in moving along
> this
> > > approach and want to start laying the groundwork I'm happy to provide
> > > guidance and help out.
> > >
> >
> > I am interested. For my current work, I do want to keep focus on the
> > sys.* performance work. If there's a way to do it and lay the
> > groundwork or even get all the work done, then I am 100% for that.
> > Looking at what you want to do to convert these sys.* to native
> > tables, if we have a viable solution or are comfortable with my
> > suggestions above I'd be happy to build it out.
> >
> > Thanks
> > Jason
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
>

Re: Push-down of operations for SystemSchema tables

Reply via email to