Regarding the SCOPE paper you reference. That was on my mind too (I went to the talk at SIGMOD). A materialized view is created only if the same query is used *textually identically* in different parts of the ETL process, so it is mainly for optimizing batch jobs that are largely the same night after night. Lattices are a better approach for optimizing interactive BI work-loads.
Julian > On Aug 6, 2018, at 4:57 PM, Jesus Camacho Rodriguez > <[email protected]> wrote: > > You can find an overview of the work that has been done in Hive for > materialized view integration in the following link: > https://cwiki.apache.org/confluence/display/Hive/Materialized+views > <https://cwiki.apache.org/confluence/display/Hive/Materialized+views> > Materialized views can be stored in external tables such as Druid-backed > tables too. Druid rules that in Calcite are used to push computation > to Druid from Hive. > > The rewriting algorithm itself is in Calcite. The algorithm can take advantage > of constraints (PK-FK relationship between tables) to produce additional > correct rewritings, can execute rollups, etc. However, it does not assume any > specific schema layout, which may make it useful for multiple ETL workloads. > http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information > > <http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information> > The most recent addition is the support for partitioned materialized views, > including the extension in the cost model to take into account partition > pruning > during the planning phase. > > Incremental maintenance is supported. Most of that code lives in Hive, but it > relies > on the rewriting algorithm too. It only works for materialized views that use > Hive > transactional tables, either full ACID or insert-only. Basically Hive exposes > explicitly > the data contained in the materialization via filter condition, e.g., mv1 > contains data > for transactions (x, y, z), then let the rewriting algorithm trigger a > partial rewriting > which reads new contents from the sources tables and processed contents from > mv1. > Finally, an additional step transforms the rewritten expression into an > INSERT or > MERGE statement depending on the materialized view expression (MERGE for > materialized views containing aggregations). Since not all tables in Hive > support > UPDATE needed for MERGE, we were thinking about allowing some target > materialized > views with definitions that include aggregates to use INSERT and then force > the rollup > at runtime, e.g., for Druid. > bq. Maybe it depends on the aggregation functions that are used? > The result of some aggregate functions cannot be (always) incrementally > maintained in > the presence of UPDATE/DELETE operations on source tables, e.g., min and max, > though > some rewriting to minimize full rebuilds can be used if count is added as an > additional > column to the materialized view. Incremental maintenance in presence of > UPDATE/DELETE > operations in source tables is not supported in Hive yet, hence this is not > implemented. > > > I would like to think that of the problems described below, we are getting to > the > 'more interesting stuff' in the Hive project, though there is some > consolidation needed for > existing work too. That is why we are also interested in any effort related > to materializations > recommendation. I believe the most powerful abstraction to use would be > RelNode, which > can be useful for any system representing its queries internally using that > representation, > instead of relying on SQL nodes which are more closely tight to the parser. > > Concerning the ´feedback loop´, this recent paper by MSFT describes a system > that does > something similar to what James was describing (for SCOPE): > https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf > > <https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf> > > -Jesús > > > > On 8/6/18, 3:32 PM, "Julian Hyde" <[email protected] > <mailto:[email protected]>> wrote: > > It’s hard to automatically recommend a set of MVs from past queries. The > design space is just too large. But if you are designing MVs for interactive > BI, you can use the “lattice” model. This works because many queries will be > filter-join-aggregate queries on a star schema (i.e. a central fact table and > dimension tables joined over many-to-one relationships). (Or perhaps a join > between two or more such queries.) > > Do the queries you are trying to optimize have that pattern? > > If so, you might start by creating a lattice for each such star schema. > Then the lattice can suggest MVs that are summary tables. > > (Lattice suggester is one step more meta - it recommends lattices - but > given where you are, I would suggest hand-writing one or two lattices.) > > Calcite is a framework, and this unfortunately means that you have to > write Java code to use these features. It might be easier if you use the new > “server” module, which supports CREATE MATERIALIZED VIEW as a DDL statement. > Then you can create some demos for your colleagues that are wholly or mostly > SQL. > > The simplest way to populate a materialized view is the CREATE > MATERIALIZED VIEW statement. It basically does the same as CREATE TABLE AS > SELECT (executes a query, stores the results in a table) but it leaves behind > the metadata about where that data came from. > > Materialized views can in principle be maintained incrementally, but how > you do it depends upon what changes are allowed (append only? Replace rows > and write the old rows to an audit table?). We’ve not done a lot of work on > it. I believe the Hive folks have given this more thought than I have. > > Julian > > >> On Aug 3, 2018, at 11:11 PM, James Taylor <[email protected]> wrote: >> >> Both the Lattice Suggestor and Quark sound like what I need for an >> automated solution, but I have some more basic follow up questions first. >> Here's our basic use case (very similar to Zheng Shao's, I believe): >> - Our company has stood up Presto for data analysts >> - Nightly ETL jobs populate Hive tables with lots of data >> - Analysts run adhoc queries over data using Presto >> - The top CPU using queries are pretty complex (2-3 pages of complex SQL, >> lots of joins and aggregation) >> >> There are some basic/obvious stuff that can be done manually first: >> - Provide better visibility into which queries are expensive >> - Ask query owners to produce their own materialized views and manually >> change their queries to use them (I believe there's some amount of this >> already) >> >> Then there's kind of a middle ground: >> - Ask query owners to identify what they think are the top few materialized >> views to build >> - Manually build these materialized views in the daily ETL job. >> - Use Calcite to rewrite the query to use the materialized views. Can >> Calcite do this and would it be a problem if the queries are Presto >> queries? I'd need to make sure I provided Calcite with the cost information >> it needs, right? >> - Dark launch to test that the rewritten query returns the same results as >> the original query (and measure the perf improvement) >> >> But the more interesting stuff is: >> - Automatically identifying the materialized views that should be built. >> Sounds like both the Lattice Suggestor and Quark are potentially a good >> fit. I'm not clear on what is output by the Suggestor. Would it spit out a >> CREATE VIEW statement (or could what it outputs produce that)? How does the >> Suggestor compare with Quark? >> - Automatically build the materialized views. Would the Lattice framework >> or Quark help me with that? Would it be possible to incrementally build the >> materialized views or would it be necessary to build the materialized views >> from the beginning of time again and again (clearly not feasible given the >> size of the tables)? Maybe it depends on the aggregation functions that are >> used? >> - And the nirvana is a kind of feedback loop - based on the top N expensive >> queries, identify and build the materialized views, use them transparently >> during querying, and then retire them when they're infrequently used. >> >> Would it be a better choice to build the materialized views as Druid >> tables? That'd require a Druid connector to Presto, though. This reminds me >> of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731) >> but for Presto instead of Hive. Do you think any of that would transfer >> over in some way? >> >> WDYT? Huge amount of work? Any advice is much appreciated. >> >> Thanks, >> James >> >> On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <[email protected] >> <mailto:[email protected]><mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> >>> PS >>> >>> +1 for Babel. >>> >>> If you are analyzing a set of queries, it is very likely that these >>> queries were written to be executed against another database. Babel aims to >>> take such queries and convert them into Calcite relational algebra. The >>> process might occasionally be lossy, if Calcite's algebra does not support >>> a feature, but the algebra is still useful. >>> >>>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> There are many possible analyzers, but Lattice Suggester is one that I >>> am working on and is relatively mature. It looks at lots of queries and >>> builds lattices (star schemas with measures) from those queries. It finds >>> commonality by “growing” lattices - adding measures, adding derived >>> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL >>> query strings, then it parses them (to SqlNode), validates, and converts to >>> relational algebra (RelNode). Then it looks for patterns in the relational >>> algebra. Working at the algebra level as opposed to the SQL parse tree is a >>> net benefit, but some things (e.g. figuring out the original column alias >>> for an expression) are a bit more difficult. >>>> >>>> Lattice suggester is under development in my >>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester >>> <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester> < >>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester >>> <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester><https://github.com/julianhyde/calcite/tree/1870-lattice-suggester >>> <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>>> >>> branch. The tests pass, and I hope to have it merged into master in the >>> next month or two. >>>> >>>> I agree with Devjyoti that it’s difficult to gather together all >>> possible analyses in one tool. Lattice Suggester is a good base for >>> analyses that model queries as a filter/project/aggregate of a pre-joined >>> star schema — a very common model in BI; for example, it could model which >>> combinations of columns are commonly used as filters. >>>> >>>> For analyses that are not tied to star schemas, feel free to create new >>> tools. The tools would benefit from collaborative development, and I think >>> that Calcite would be a good home for them. >>>> >>>> Julian >>>> >>>> >>>> >>>> >>>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <[email protected] >>>>> <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>>> wrote: >>>>> >>>>> Hi Zheng, >>>>> >>>>> At Qubole, we are building something very similar to what you are >>> looking >>>>> for. And from experience, I can tell you that it is a lot easy to build >>> it >>>>> than what one may think. >>>>> We use Calcite parser to parse the SQL into Sqlnode and then use >>> different >>>>> tree visitors to extract query attributes like tables, filter columns, >>>>> joins, subqueries etc., >>>>> >>>>> Our approach is very similar to Uber's QueryParser project ( >>>>> https://github.com/uber/queryparser <https://github.com/uber/queryparser> >>>>> <https://github.com/uber/queryparser >>>>> <https://github.com/uber/queryparser>> <https://github.com/uber/ >>>>> <https://github.com/uber/> <https://github.com/uber/ >>>>> <https://github.com/uber/>> >>> queryparser> ), but we go deeper in our analysis of >>>>> finding queries that are semantically similar to some canonicalized >>> form. >>>>> If you intend to begin from scratch, I can give you some pointers to get >>>>> started. >>>>> >>>>> Thanks, >>>>> Devjyoti >>>>> >>>>> >>>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <[email protected] >>>>> <mailto:[email protected]><mailto:[email protected] >>>>> <mailto:[email protected]>> <mailto: >>> [email protected]>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We are thinking about starting a project to analyze huge number of SQL >>>>>> queries (think millions) to identify common patterns: >>>>>> * Common sub queries >>>>>> * Common filtering conditions (columns) for a table >>>>>> * Common join keys for table pairs >>>>>> >>>>>> Are there any existing projects on that direction using Calcite? Would >>>>>> love to leverage instead of building from scratch. >>>>>> >>>>>> Zheng
