MaterializationService.THREAD_INSTANCE vs INSTANCE
Hi. I'm trying to use Calcite to optimize queries rewriting the queries to use materialized views. First, I found the following doc page about materialization in Calcite website by googling but I coudn't find a link to it in Calcite's website. Anyway, it does not explain how to add materialized views to the schema. http://calcite.apache.org/docs/materialized_views I'm trying to follow how Cassandra Adapter does it. It uses de Hook.TRIMMED to be called to add the materialized views. Is that the recommended approach? I noticed that MaterializationService.instance() first tries to get an instance from a ThreadLocal field (THREAD_INSTANCE) and if it is null then uses the field instance (INSTANCE). The comments in the class says that ThreadLocal instance is supposed to be used for testing. However, THREAD_INSTANCE field is initialized with ThreadLocal.withInitial(MaterializationService::new), so it seems that it will never be null, making instance() method always return the THREAD_INSTANCE instance. The problem is that I'm running the statements in worker threads and sometimes the materialized view is used and sometimes not. I think the problem is that the materialization is not available when running in a different thread. Is that a bug? Any tips on how to add materialized views to an AbstractSchema? Regards, Luis Fernando
Benchmarking Calcite - starting the conversation on the targets and design of the benchmark
I'd like to share the following link that I came across sime time ago about a product that uses Calcite to optimize Spark queries. https://www.datascience.com/blog/grunion-data-science-tools-query-optimizer-apache-spark
[jira] [Created] (CALCITE-2081) NPE in join of 2 subqueries using window functions
Luis Fernando Kauer created CALCITE-2081: Summary: NPE in join of 2 subqueries using window functions Key: CALCITE-2081 URL: https://issues.apache.org/jira/browse/CALCITE-2081 Project: Calcite Issue Type: Bug Reporter: Luis Fernando Kauer Assignee: Julian Hyde NPE when joining 2 subqueries that use window functions. Testing with a new test case in JdbcTest: {code} /** Tests for a join of subqueries using Window Functions */ @Test public void testJoinWithWinAgg() { final String sql = "select a.*, b.r from\n" + "(select \"deptno\", first_value(\"empid\") over \n" + "(partition by \"deptno\" order by \"commission\") as r\n" + "from \"hr\".\"emps\") a\n" + "left join\n" + "(select \"deptno\", last_value(\"empid\") over \n" + "(partition by \"deptno\" order by \"commission\") as r\n" + "from \"hr\".\"emps\") b\n" + "on a.\"deptno\" = b.\"deptno\""; CalciteAssert.hr() .query(sql) .runs(); } {code} Debugging this I found out that the NPE occurs at Expressions.declare (line 2937) in initializer.getType() because initializer is null. The statement is {code} int prevStart; {code} This statement does not have an initializer, but the method "declare" is trying to get the initializer's type, generating NPE. This happens when joining 2 subqueries that use window functions because they end up using the same variables. In BlockBuilder.append (line 124) it checks if the variable already exists and the problem starts there. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: BindableTableScan is losing aliased named
Please insert the following test in ScannableTableTest and run: /** ProjectableFilterableTable project push down using alias. */ @Test public void testPFPushDownWithAlias() throws Exception { final StringBuilder buf = new StringBuilder(); final Table table = new BeatlesProjectableFilterableTable(buf, true); CalciteAssert.that() .with(newSchema("s", "beatles", table)) .query("select \"i\" as theKey from \"s\".\"beatles\"") .explainContains("EnumerableInterpreter\n" + " BindableTableScan(table=[[s, beatles]], projects=[[0]])\n") .returnsUnordered("THEKEY=4", "THEKEY=4", "THEKEY=6", "THEKEY=5"); } It passes for me. You'll need the updated version of this test class, because it was recently refactored. It's usually easier for other people to test the problem if you create runnable test cases to show the problem. Em sexta-feira, 17 de novembro de 2017 14:29:09 BRST, Enrico Olivelli <eolive...@gmail.com> escreveu: In the RowType I have 'k1' and not "thekey" Enrico 2017-11-17 17:20 GMT+01:00 Luis Fernando Kauer <lfka...@yahoo.com.br.invalid >: > Did you execute the query? > ProjectRemoveRule removes the Project when it is "trivial". Since the > only used project was pushed to BindableTableScan, the Project would only > set the alias, but that can be done in row type. > The result is correct because RowType is preserved with the alias. It > just does not show in the plan. > However, I seems that repeating a column with different aliases generates > an error: > SELECT k1 theKey, k1 theKey2 FROM tblspace1.tsql where k1 ='mykey2' > > Regards, > Luis Fernando > > Em sexta-feira, 17 de novembro de 2017 12:27:49 BRST, Enrico Olivelli < > eolive...@gmail.com> escreveu: > > Hi, > I have a ProjectableFilterableTable, it seems to me that the BindablaScan > RelNode does not keep track of column name aliases in the original > projection > > Example: > > Query:SELECT k1 theKey FROM tblspace1.tsql where k1 ='mykey2' > > -- Logical Plan > LogicalProject(THEKEY=[$0]) > LogicalFilter(condition=[=($0, 'mykey2')]) > EnumerableTableScan(table=[[tblspace1, tsql]]) > > -- Best Plan > EnumerableInterpreter > BindableTableScan(table=[[tblspace1, tsql]], filters=[[=($0, > 'mykey2')]], > projects=[[0]]) > > Is this correct ? > IMHO It would better to keep somethin like an EnumerableProjection which > renames the fields > > Am I missing something ? > Thanks > Enrico > >
Re: BindableTableScan is losing aliased named
Did you execute the query? ProjectRemoveRule removes the Project when it is "trivial". Since the only used project was pushed to BindableTableScan, the Project would only set the alias, but that can be done in row type. The result is correct because RowType is preserved with the alias. It just does not show in the plan. However, I seems that repeating a column with different aliases generates an error: SELECT k1 theKey, k1 theKey2 FROM tblspace1.tsql where k1 ='mykey2' Regards, Luis Fernando Em sexta-feira, 17 de novembro de 2017 12:27:49 BRST, Enrico Olivelliescreveu: Hi, I have a ProjectableFilterableTable, it seems to me that the BindablaScan RelNode does not keep track of column name aliases in the original projection Example: Query:SELECT k1 theKey FROM tblspace1.tsql where k1 ='mykey2' -- Logical Plan LogicalProject(THEKEY=[$0]) LogicalFilter(condition=[=($0, 'mykey2')]) EnumerableTableScan(table=[[tblspace1, tsql]]) -- Best Plan EnumerableInterpreter BindableTableScan(table=[[tblspace1, tsql]], filters=[[=($0, 'mykey2')]], projects=[[0]]) Is this correct ? IMHO It would better to keep somethin like an EnumerableProjection which renames the fields Am I missing something ? Thanks Enrico
Re: Project + Sort on single and on multiple columns
There is something I don't get about Sort in Calcite. Actually, it seems to happen already when parsing the SQL. Sort does not limit the projects, but it requires that the projects used for ordering are included in its input. How can a Sort be the top node when there are projects used for ordering that are not in the select list? I thought there should always be a top project node selecting the projects to return when it needs to restrict or change from the previous node. The problem is that the lack of a Project node on top restricting the projects will make adapters return more columns than are actually necessary. Example 1: >> SELECT empno FROM scott.emp ORDER BY ename Logical Plan: LogicalSort(sort0=[$1], dir0=[ASC]) LogicalProject(EMPNO=[$0], ENAME=[$1]) JdbcTableScan(table=[[SCOTT, EMP]]) Physical Plan: JdbcToEnumerableConverter JdbcProject(ENAME=[$1], EMPNO=[$0]) JdbcSort(sort0=[$0], dir0=[ASC]) JdbcTableScan(table=[[SCOTT, EMP]]) The generated SQL to query the database is: SELECT ENAME, EMPNOFROM (SELECT * FROM SCOTT.EMP ORDER BY EMPNO NULLS LAST) AS t --- Example 1: > SELECT ename FROM scott.emp ORDER BY empno limit 10 - Logical Plan: LogicalSort(sort0=[$1], dir0=[ASC], fetch=[10]) LogicalProject(EMPNO=[$0], ENAME=[$1]) JdbcTableScan(table=[[SCOTT, EMP]]) - Physical Plan: JdbcToEnumerableConverter JdbcProject(EMPNO=[$0], ENAME=[$1]) JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10]) JdbcTableScan(table=[[SCOTT, EMP]]) The generated SQL to query the database is: SELECT EMPNO, ENAMEFROM (SELECT * FROM SCOTT.EMP ORDER BY ENAME NULLS LAST FETCH NEXT 10 ROWS ONLY) AS t0 I'm facing 3 main problems: 1. There's not a Project node that limits the projects above Sort even in the Logical Plan. 2. RelToSqlConverter does not work right with Sort 3. Sometimes JdbcProject is above JdbcSort and others is below. Anybody can help me figure this out? Em sábado, 11 de novembro de 2017 09:35:40 BRST, Christian Tzolov <ctzo...@pivotal.io> escreveu: @Julian, this exactly what i would expect as logical behavior. In fact it (consistently) behaves like this (e.g. Sort is after Project) when the Sort is on a single column! But for some reasons if the Sort is performed on two columns (same used in the Project) then the Sort is performed before the Project! This would be an expected if the Sort columns are not contained in the Project such. So i wonder could it be that somewhere during the Sort optimization the this column check gets confused? On 11 November 2017 at 03:05, Julian Hyde <jh...@apache.org> wrote: > While the cost of Project does not depend heavily on the number of input > columns, the cost of Sort (or at least a typical Sort algorithm such as > external merge sort) does depend on the number of columns (or more > precisely on the average row size in bytes). So, if the Project reduces the > number of columns (as most Projects do) then the Sort will have lower cost > if performed after the Project, because it is handling fewer bytes. > > > > On Nov 10, 2017, at 10:32 AM, Luis Fernando Kauer > <lfka...@yahoo.com.br.INVALID> wrote: > > > > I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906 > and I'm facing a similar problem. > > After managing to make JdbcSort to work, sometimes the JdbcProject is > above JdbcSort and the generated SQL is wrong because RelToSqlConverter > uses SqlImplementator.Clause enum to decide when to create subqueries, but > since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use > the same query because it already used ORDER_BY. > > The rule responsable for this is SortProjectTransposeRule. The opposite > rule is ProjectSortTransposeRule, but this one only matches if the sort > node is exactly Sort.class, so it ends up not matching. > > Is pushing the Project above Sort usually a good final plan or is it > done to allow other rules to match? If it is not, maybe we should solve > this in the core project. > > > > > > > > Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Mior < > mm...@uwaterloo.ca> escreveu: > > > > Since the cost of the project doesn't depend on the number of columns > being > > projected or the size of the input, putting the project before or after > the > > sort will result in the same estimated cost. One approach would be to > scale > > the cost of the projection based on the fraction of columns projected. > > > > -- > > Michael Mior > > mm...@apache.org > > > > 2017-11-10 12:42 GMT-05:00 Christian Tzolov <ctzo...@pivotal.io>: > > > >> I've observed in my > >> no-sql adapter > >> implementation that for q > >>
Re: Aliases for schemas
Hi, FrameworkConfig has getDefaultSchema(), from javadoc "Returns the default schema that should be checked before looking at the root schema. Returns null to only consult the root schema". So, if you are using Frameworks.newConfigBuilder(), you can set it using ".defaultSchema( rootSchema.getSubSchema( "one" ))", for example. Em terça-feira, 14 de novembro de 2017 10:59:33 BRST, Enrico Olivelliescreveu: Hi, (I am playing with the planner, you know) I have several "schema" in my system, like "one", "two" So my root schema is built this way: final SchemaPlus rootSchema = Frameworks.createRootSchema(true); for (String tableSpace : listTableSpaces()) { SchemaPlus schema = rootSchema.add(tableSpace, new AbstractSchema()); } Queries like SELECT * from one.mytable SELECT * from two.mytable get parsed and validated. But I would like to set a "default" schema to be used when the query does not explicitly refer to a schema, like default-schema='one' SELECT * from mytable which I would expect to be parsed as SELECT * from one.mytable Is it possible ? Bonus question, separate from the former, it seems that the parser complaints if I try to use a schema with name "default", using MYSQL5_5 compatinility for instance. Is there any tweak to make it work ? Calcite is really great! Enrico
Re: Project + Sort on single and on multiple columns
I'm trying to fix https://issues.apache.org/jira/browse/CALCITE-1906 and I'm facing a similar problem. After managing to make JdbcSort to work, sometimes the JdbcProject is above JdbcSort and the generated SQL is wrong because RelToSqlConverter uses SqlImplementator.Clause enum to decide when to create subqueries, but since ORDER_BY is after SELECT, once it gets to JdbcProject it can't use the same query because it already used ORDER_BY. The rule responsable for this is SortProjectTransposeRule. The opposite rule is ProjectSortTransposeRule, but this one only matches if the sort node is exactly Sort.class, so it ends up not matching. Is pushing the Project above Sort usually a good final plan or is it done to allow other rules to match? If it is not, maybe we should solve this in the core project. Em sexta-feira, 10 de novembro de 2017 15:57:34 BRST, Michael Miorescreveu: Since the cost of the project doesn't depend on the number of columns being projected or the size of the input, putting the project before or after the sort will result in the same estimated cost. One approach would be to scale the cost of the projection based on the fraction of columns projected. -- Michael Mior mm...@apache.org 2017-11-10 12:42 GMT-05:00 Christian Tzolov : > I've observed in my > no-sql adapter > implementation that for q > ueries with > P > roject > + > S > ort by > ONE > column > t > he > Project > > is pushed (as expected) > before the Sort but for Sort > on MULTIPLE > columns > the Sort is before the Project. > For example > for a query with one > sort column: > > SELECT yearPublished FROM BookMaster ORDER BY yearPublished ASC > > The plan looks like expected (project before the sort) > > > PLAN=GeodeToEnumerableConverterRel > *GeodeSortRel*(sort0=[$0], dir0=[ASC]) > GeodeProjectRel(yearPublished=[$2]) > GeodeTableScanRel(table=[[TEST, BookMaster]]) > > But > for sort > with > two > > columns: > > SELECT yearPublished, itemNumber from BookMaster ORDER BY yearPublished > ASC, itemNumber ASC > > The > the plan is: > > > PLAN=GeodeToEnumerableConverterRel > GeodeProjectRel(yearPublished=[$2], itemNumber=[$0]) > *GeodeSortRel*(sort0=[$2], sort1=[$0], dir0=[ASC], dir1=[ASC]) > GeodeTableScanRel(table=[[TEST, BookMaster]]) > > I'm not sure i can explain > why in the second case the Sort appears before the Project? > Here > are my cost functions: > > * > GeodeSortRel > : > https://github.com/tzolov/calcite/blob/geode-1.3/geode/ > src/main/java/org/apache/calcite/adapter/geode/rel/GeodeSortRel.java#L51 > > * GoedeProjectRel: > https://github.com/tzolov/calcite/blob/4a631d9055340f64f5e644454551f9 > 64ea08f9de/geode/src/main/java/org/apache/calcite/adapter/geode/rel/ > GeodeProjectRel.java#L52 > > > > Cheers, > Christian >
Re: Recycling Planner objects
Sorry, My knowledge hasn't reached that far. Currently the tutorial and example code is mostly for creating new adapters. I think we need tutorial, documentation and example code to work directly with the planner, because it is a lot harder. Regards, Luis Fernando Em quinta-feira, 9 de novembro de 2017 06:52:26 BRST, Enrico Olivelliescreveu: I am going to use Calcite Planner inside my DBMS. I would like to understand which objects can be reused and which objects are treated in a dynamic fashion. Planner object, is this cachable ? I see i is not stateless. Is it better to alwayes create a new instance or could it be reclycled ? Table statistics: are them evaluated at every time a query is to be planned ? at least the table size Schemas: in a live system schema changes, but as far as I have undestood the Schema is part of Framework config. Which is the best strategy ? to drop the current framework config and all cached objects in case of a DDL command ? Execution Plans: Are RelNodes cachable ? I see they have strong references to the planner which built the object. I am looking to a cache like QUERY -> PhisycalPlan Regards Enrico
Re: Volcano planner
If you intend to run a query then you should follow the tutorial and try to change the csv adapter. You can add the table to the schema at runtime using something like: - Class.forName("org.apache.calcite.jdbc.Driver"); Properties info = new Properties(); info.setProperty("lex", "MYSQL_ANSI"); final Connection connection = DriverManager.getConnection("jdbc:calcite:", info); CalciteConnection conn = connection.unwrap(CalciteConnection.class); SchemaPlus root = conn.getRootSchema(); root.add("MYTABLE", new TableImpl()); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery("SELECT * FROM MYTABLE"); - But if you only want to parse, validate and optimize the query plan, you can use something like: - Table table = new TableImpl(); final SchemaPlus rootSchema = Frameworks.createRootSchema(true); SchemaPlus schema = rootSchema.add("x", new AbstractSchema()); schema.add("MYTABLE", table); List traitDefs = new ArrayList<>(); traitDefs.add(ConventionTraitDef.INSTANCE); traitDefs.add(RelCollationTraitDef.INSTANCE); SqlParser.Config parserConfig = SqlParser.configBuilder(SqlParser.Config.DEFAULT) .setCaseSensitive(false) .build(); final FrameworkConfig config = Frameworks.newConfigBuilder() .parserConfig(parserConfig) .defaultSchema(schema) .traitDefs(traitDefs) // define the rules you want to apply .programs(Programs.ofRules(Programs.RULE_SET)) .build(); Planner planner = Frameworks.getPlanner(config); SqlNode n = planner.parse(" SELECT * FROM MYTABLE WHERE ID < 10"); n = planner.validate(n); RelNode root = planner.rel(n).project(); System.out.println(RelOptUtil.dumpPlan("-- Logical Plan", root, SqlExplainFormat.TEXT, SqlExplainLevel.DIGEST_ATTRIBUTES)); RelOptCluster cluster = root.getCluster(); final RelOptPlanner optPlanner = cluster.getPlanner(); RelTraitSet desiredTraits = cluster.traitSet().replace(EnumerableConvention.INSTANCE); final RelNode newRoot = optPlanner.changeTraits(root, desiredTraits); optPlanner.setRoot(newRoot); RelNode bestExp = optPlanner.findBestExp(); System.out.println(RelOptUtil.dumpPlan("-- Best Plan", bestExp, SqlExplainFormat.TEXT, SqlExplainLevel.DIGEST_ATTRIBUTES)); - The main problem was that you were not setting the desired trait to use EnumerableConvention. You can see that instead of implementing all the interfaces you should use the available builders and classes. Also for implementing Table I think you should extend AbstractTable instead of implementing Table interface and you can use Statistics.of instead of implementing Statistic interface if it is simple: - private static class TableImpl extends AbstractTable { public TableImpl() {} @Override public RelDataType getRowType(RelDataTypeFactory typeFactory) { Builder builder = new RelDataTypeFactory.Builder(typeFactory); return builder.add("id", typeFactory.createSqlType(SqlTypeName.INTEGER)) .add("name", typeFactory.createSqlType(SqlTypeName.VARCHAR)).build(); } @Override public Statistic getStatistic() { return Statistics.of(15D, ImmutableList.of(), ImmutableList.of(RelCollations.of(0), RelCollations.of(1))); } } - Em quarta-feira, 8 de novembro de 2017 12:15:34 BRST, Enrico Olivelliescreveu: Hi, I am playing with the planner but I can't get it work for a very simple query. Th table is MYTABLE(id integer, name varchar) definition is given in code snippet the query is "SELECT * FROM MYTABLE" The error is: org.apache.calcite.plan.RelOptPlanner$CannotPlanException: Node [rel#7:Subset#0.NONE.[0, 1].any] could not be implemented; planner state: Root: rel#7:Subset#0.NONE.[0, 1].any Original rel: LogicalProject(subset=[rel#6:Subset#1.NONE.[0, 1].any], id=[$0], name=[$1]): rowcount = 15.0, cumulative cost = {15.0 rows, 30.0 cpu, 0.0 io}, id = 5 EnumerableTableScan(subset=[rel#4:Subset#0.ENUMERABLE.[0, 1].any], table=[[default, MYTABLE]]): rowcount = 15.0, cumulative cost = {15.0 rows, 16.0 cpu, 0.0 io}, id = 2 Sets: Set#0, type: RecordType(INTEGER id, VARCHAR name) rel#4:Subset#0.ENUMERABLE.[0, 1].any, best=rel#2, importance=0.9 rel#2:EnumerableTableScan.ENUMERABLE.[[0, 1]].any(table=[default, MYTABLE]), rowcount=15.0, cumulative cost={15.0 rows, 16.0 cpu, 0.0 io} rel#9:EnumerableProject.ENUMERABLE.[[0, 1]].any(input=rel#4:Subset#0.ENUMERABLE.[0,
[jira] [Created] (CALCITE-2044) Tweak cost of BindableTableScan to make sure Project is pushed through Aggregate
Luis Fernando Kauer created CALCITE-2044: Summary: Tweak cost of BindableTableScan to make sure Project is pushed through Aggregate Key: CALCITE-2044 URL: https://issues.apache.org/jira/browse/CALCITE-2044 Project: Calcite Issue Type: Bug Components: core Reporter: Luis Fernando Kauer Assignee: Julian Hyde Priority: Minor Similar to [CALCITE-1876]. Projects are not pushed to BindableTableScan when using ProjectableFilterableTable with aggregate functions. The reason is that the cost of BindableTableScan does not use projects (and filters), so the planner chooses a plan with Project node removed by ProjectRemoveRule. By tweaking the cost to use the number of used projects solved the problem. Any suggestion on the cost formula to take both projects and filters into account? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: ProjectableFilterableTable.scan() and expensive columns
Alexey, The example of computeSelfCost multiplies the cost by projects.size() / identity().size(), where projects.size() is the number of columns to be read (used) and identity.size() is the total number of columns of the table. So the cost is reduced if less columns need to be read instead of reading them all. You could use projects.size() / table.getRowType().getFieldCount() instead, but it's the same thing. Or you could use something more complex like in DruidQuery.computeSelfCost. If you plan on using ProjectableFilterableTable then you'll have to fix BindableTableScan.computeSelfCost, because that's what is used to calculate the cost and currently it does not use the projects size. But you can use TranslatableTable just like int CSV Adapter. The problem with not pushing the used projects when using some aggregate functions is also solved by implementing computeSelfCost in CsvTableScan: @Override public RelOptCost computeSelfCost(RelOptPlanner planner, RelMetadataQuery mq) { return super.computeSelfCost(planner, mq).multiplyBy(fields.length/table.getRowType().getFieldCount()); } And by using TranslatableTable you have full control on how to compute the cost to tailor your needs without having to change anything in Calcite, just in your adapter. If you need to push filters down to your table, you'll need to implement that through rules. This can be more advanced because you'll need to learn how to create rules but it will be worth it, because you will understand Calcite better and be able to solve other problems by yourself. Em segunda-feira, 6 de novembro de 2017 12:09:06 BRST, Alexey Roytman <alexey.royt...@oracle.com> escreveu: Hello, Julian. The changes that belong to the Calcite itself -- definitely, I will contribute them. For now I have only javadoc-related in pull request. I understand that my approach shall be flexible and universal enough, and aligned with Calcite's ideology (that I'm not acquainted with). So, for now my only idea is to have an interface ProjectableFilterableTableScanCostEstimator that has a: @Override public RelOptCost computeScanCost(RelOptPlanner planner, RelMetadataQuery mq, final List list, final int[] projected); And at some moment during VolcanoPlanner.findBestExp() flow, we need to check whether a (table.table instanceof ProjectableFilterableTableScanCostEstimator) and if yes, then multiply by computeScanCost(). Maybe, the right place is somewhere near BindableTableScan.computeSelfCost(), but I don't remember whether I have an access to table.table there... Regarding your approach (https://issues.apache.org/jira/browse/CALCITE-1933) of ExpressionComplexity -- I don't get where shall I have an instance of this, and how do I push such a beast to ProjectableFilterableTable instance. I.e. I don't understand the idea of data composition here... - Alexey. On 11/01/2017 10:37 PM, Julian Hyde wrote: > Alexey, > > Do you intend to contribute your changes back to Calcite? If not, feel free > to disregard the following (but also downgrade the amount of free advice that > you expect from this list; we all have other priorities). > > If you do intend to contribute back, you should keep in mind that some > changes will be more acceptable than others. It’s not sufficient that the > changes solve your problem; your changes must not cause problems for other > people. > > For instance, adding a field to SqlTypeName is probably not good because it > doesn’t allow people to customize the cost. Also, when extending a class or > interface by adding methods and fields, ask yourself whether it was intended > to be extended. > > The absolute most useful code you can contribute is a test case that covers > all of your requirements and is simple for everyone else to run. Then we can > safely refactor your implementation and know that you are still getting what > you need. > > Are you aware of > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CALCITE-2D1933=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA=wxCJy0-IChQRB53y7qCSONbcgJj24vGDvV1P8DylhMg=LBpaGzJP6rHVxjUmLc_MXY_w4P_jkJSIx5QhtdXGuiw= > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CALCITE-2D1933=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA=wxCJy0-IChQRB53y7qCSONbcgJj24vGDvV1P8DylhMg=LBpaGzJP6rHVxjUmLc_MXY_w4P_jkJSIx5QhtdXGuiw=> > ? There I suggested adding a new kind of metadata called > ExpressionComplexity. You could write your own handler that takes into > account SqlTypeName or whatever you like. > > Julian > > > >> On Oct 31, 2017, at 10:13 AM, Luis Fernando Kauer >> <lfka...@yahoo.com.br.INVALID> wrote: >> >> Hi, >> I
Re: ProjectableFilterableTable.scan() and expensive columns
Hi, I'm also newbie and I'm just trying to help with I'm learned so far. It's all about the rules. I strongly recommend taking a look into: https://calcite.apache.org/docs/howto.html#tracing Specially about setting the logger lever to DEBUG. By doing this you can check which rules are being fired, the costs and chosen plan. As I mentioned in my last reply, the problem seems to be that BindableTableScan.computeSelfCost does not use the information about used projects and filters to compute the cost of the BindableTableScan. Many rules are applied that convert the nodes and at the end the plan with the lowest cost is chosen. Among the generated plans are plans that push the projects to the BindableTableScan and others that do not. The total cost of the plan that pushes the projects and the filter is the probably the same or higher than the cost of the plan that pushes only the filter and removes the projects (through AggregateProjectMergeRule) and the planner ends up choosing the latter. So maybe just by fixing computeSelfCost it solves the problem. Just for the purpose of testing this, I changed computeSelfCost to: @Override public RelOptCost computeSelfCost(RelOptPlanner planner, RelMetadataQuery mq) { return super.computeSelfCost(planner, mq).multiplyBy((0.01 + projects.size())/identity().size()); } And only by doing this, the planner chose a plan that pushes the projects and filters to BindableTableScan when used with aggregation. You can also override getStatistic() for your implementation of ProjectableFilterableTable to provide some statistics to be used in computing costs. Calcite also allows other ways to provide information and statistics but that's beyond my knowledge. Regards, Luis Fernando Em terça-feira, 31 de outubro de 2017 13:38:05 BRST, Alexey Roytman <alexey.royt...@oracle.com> escreveu: Sorry for the delay, Luis Fernando. Please see below, as there are a number of answers. On 10/26/2017 09:37 PM, Luis Fernando Kauer wrote: > I'm sorry, but I have no idea what you are talking about. > Cassandra Adapter has code to translate the plan to run the query in > Cassandra Server. > If you are only interested in querying CSV files I don't see how copying that > code without understanding it will help you. [Alexey]I need neither Cassandra nor CSV adapter. Cassandra was mentioned by Julian, so I started investigated it. The CVS files were used because this way I can create a working test to share with community to ask questions. > First of all, you need to decide whether you will use > ProjectableFilterableTable or TranslatableTable. [Alexey] I would like to use ProjectableFilterableTable, but it does a poor work for me: starting from a certain level of nesting, it wants all projections. And for my task it's a too heavy query: my numeric columns are remotely calculated, with different (and unpredictable) amount of work for each column. [Alexey] So, as Julian mentioned Cassandra's interface, I started with it. There was as complication: it has both Translatable and both Queriable interfaces. For Queriable interfaces, the RexNode are translated to List (all these translateBinary2() functions etc) and passed via reflection call to the query. But for me, at the query level, I need the RexNodes themselves, and I don't want to parse these List back! [Alexey] So, I've started again, but with the Druid adapter, which uses only Translatable interface. > You must try to understand how the rules work and how to check which rules > are being fired and which ones are being chosen. [Alexey] I do try. And when I reach certain understanding, then obviously I won't ask such newbie questions that I ask for now :-)) > Did you follow the tutorial for creating CSV Adapter? It creates a rule to > push the used projects to the table scan. That is a great start. [Alexey] I did. But even in flavor=translatable it is very simplistic, and did not do the job of having both filters and projections at the lowest level... > It's a good idea to take a look at the built in rules available in Calcite > too. > You should take a look into FilterTableScanRule and ProjectTableScanRule, > which are the rules that push the projects and filters used with > ProjectableFilterableTable into a BindableTableScan, and the other rules int > Bindables.java. [Alexey] I totally agree with you. But when I look at the code there, I understand even less than now. This will improve with time, but for now this is what I have... > The rules work fine when there is no aggregate function, pushing both filter > and projects into BindableTableScan. The problem seems to be with > AggregateProjectMergeRule which removes the Project from the plan. > If you remove the filter from your test cases you'll see that the projects > are pushed to the BindableTableScan. > I was able to simulate your problem
Re: ProjectableFilterableTable.scan() and expensive columns
BindableTableScan.computeSelfCost does not use in computing the cost if there are filters or how many projects are used, it just multiplies the cost by 0.01. Just by changing the cost to take into account the number of projects over the identity (all projects) makes the planner choose the plan that pushes projects and filter to BindableTableScan, because the overall cost is lower than removing the Project by AggregateProjectMergeRule. Any suggestion or example on a good way to compute the cost taking into account the projects and filters? Em quinta-feira, 26 de outubro de 2017 16:38:14 BRST, Luis Fernando Kauer <lfka...@yahoo.com.br.INVALID> escreveu: I'm sorry, but I have no idea what you are talking about. Cassandra Adapter has code to translate the plan to run the query in Cassandra Server. If you are only interested in querying CSV files I don't see how copying that code without understanding it will help you. First of all, you need to decide whether you will use ProjectableFilterableTable or TranslatableTable. You must try to understand how the rules work and how to check which rules are being fired and which ones are being chosen. Did you follow the tutorial for creating CSV Adapter? It creates a rule to push the used projects to the table scan. That is a great start. It's a good idea to take a look at the built in rules available in Calcite too. You should take a look into FilterTableScanRule and ProjectTableScanRule, which are the rules that push the projects and filters used with ProjectableFilterableTable into a BindableTableScan, and the other rules int Bindables.java. The rules work fine when there is no aggregate function, pushing both filter and projects into BindableTableScan. The problem seems to be with AggregateProjectMergeRule which removes the Project from the plan. If you remove the filter from your test cases you'll see that the projects are pushed to the BindableTableScan. I was able to simulate your problem using ScannableTableTest.testProjectableFilterable2WithProject changing the query into "select \"k\", count(*) from (select \"k\",\"j\" from \"s\".\"beatles\" where \"i\" = 4) x group by \"k\"". The plan: LogicalAggregate(group=[{0}], EXPR$1=[COUNT()]) LogicalProject(k=[$1]) LogicalFilter(condition=[=($0, 4)]) LogicalProject(i=[$0], k=[$2]) LogicalTableScan(table=[[s, beatles]]) PhysicalPlan: EnumerableAggregate(group=[{2}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 112 EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 110 BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 62 If I disable AggregateProjectMergeRule, the physical plan is: EnumerableAggregate(group=[{0}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 102 EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 100 BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]], projects=[[2]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 78 Regards, Luis Fernando Em quinta-feira, 26 de outubro de 2017 13:19:46 BRST, Alexey Roytman <alexey.royt...@oracle.com> escreveu: Thanks for the hints. I've tried to use [i.e. copy-pasted a lot of] Cassandra*.java for my CSV-files example. It's really too wordy! So lot of code I need to understand later!.. But what bothers me most for now is the fact that I just cannot pass List to [my modification of] CassandraTable.query(); I need to convert it to some string form within List using CassandraFilter.Translator, and then, when passed to [my modification of] CassandraTable.query(), I need to parse these List back... Is there way to eliminate this back-and-forth serialization-deserialization? - Alexey. (P.S. Sorry for not keeping the email thread for now...) Julian Hyde wrote wrote: > By "write a rule" I mean write a class that extends RelOptRule. An > example is CassandraRules.CassandraFilterRule. > ProjectableFilterableTable was "only" designed for the case that > occurs 80% of the time but requires 20% of the functionality. Rules > run in a richer environment so have more power and flexibility.
Re: ProjectableFilterableTable.scan() and expensive columns
I'm sorry, but I have no idea what you are talking about. Cassandra Adapter has code to translate the plan to run the query in Cassandra Server. If you are only interested in querying CSV files I don't see how copying that code without understanding it will help you. First of all, you need to decide whether you will use ProjectableFilterableTable or TranslatableTable. You must try to understand how the rules work and how to check which rules are being fired and which ones are being chosen. Did you follow the tutorial for creating CSV Adapter? It creates a rule to push the used projects to the table scan. That is a great start. It's a good idea to take a look at the built in rules available in Calcite too. You should take a look into FilterTableScanRule and ProjectTableScanRule, which are the rules that push the projects and filters used with ProjectableFilterableTable into a BindableTableScan, and the other rules int Bindables.java. The rules work fine when there is no aggregate function, pushing both filter and projects into BindableTableScan. The problem seems to be with AggregateProjectMergeRule which removes the Project from the plan. If you remove the filter from your test cases you'll see that the projects are pushed to the BindableTableScan. I was able to simulate your problem using ScannableTableTest.testProjectableFilterable2WithProject changing the query into "select \"k\", count(*) from (select \"k\",\"j\" from \"s\".\"beatles\" where \"i\" = 4) x group by \"k\"". The plan: LogicalAggregate(group=[{0}], EXPR$1=[COUNT()]) LogicalProject(k=[$1]) LogicalFilter(condition=[=($0, 4)]) LogicalProject(i=[$0], k=[$2]) LogicalTableScan(table=[[s, beatles]]) PhysicalPlan: EnumerableAggregate(group=[{2}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 112 EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 110 BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 62 If I disable AggregateProjectMergeRule, the physical plan is: EnumerableAggregate(group=[{0}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 102 EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 100 BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]], projects=[[2]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 78 Regards, Luis Fernando Em quinta-feira, 26 de outubro de 2017 13:19:46 BRST, Alexey Roytmanescreveu: Thanks for the hints. I've tried to use [i.e. copy-pasted a lot of] Cassandra*.java for my CSV-files example. It's really too wordy! So lot of code I need to understand later!.. But what bothers me most for now is the fact that I just cannot pass List to [my modification of] CassandraTable.query(); I need to convert it to some string form within List using CassandraFilter.Translator, and then, when passed to [my modification of] CassandraTable.query(), I need to parse these List back... Is there way to eliminate this back-and-forth serialization-deserialization? - Alexey. (P.S. Sorry for not keeping the email thread for now...) Julian Hyde wrote wrote: > By "write a rule" I mean write a class that extends RelOptRule. An > example is CassandraRules.CassandraFilterRule. > ProjectableFilterableTable was "only" designed for the case that > occurs 80% of the time but requires 20% of the functionality. Rules > run in a richer environment so have more power and flexibility.
Re: ProjectableFilterableTable.scan() and expensive columns
Hi, Can you give us some examples of the queries you tested? Include the the query plan Calcite generated. (Use EXPLAIN PLAN FOR you query) Currently, aggregates with no column reference, like count(*), generates a plan that scans all projects when using ProjectableFilterableTable. I'm not sure it there is a Jira for that already. The other option is to use TranslatableTable, but be aware that you'll have to implement rules to push the necessary projects to the table scan, unlike when using ProjectableFilterableTable which has many rules built in Calcite for that. See also:[CALCITE-1876] Create a rule to push the projections used in aggregate functions - ASF JIRA Druid adapter calls Druid to execute the table scan. In CSV Adapter it is executed by Calcite using Enumerator. Always check the query plan first and create rules to optimize it the way you want. Enable tracing if you want to check which rules were applied. https://calcite.apache.org/docs/howto.html#tracing
Re: Re: Assunto: querying elastisearch fields
Your problem, again, is about the names being case sensitive. Try using double quotes around column and table names to make sure they are interpreted correctly or use them all in uppercase, both in the json model and in the query. You can also set a different lex property for the connection that is more flexible like MYSQL_ANSI or just the caseSensitive property to false. http://calcite.apache.org/docs/adapter.htm
Assunto: querying elastisearch fields
Did you take a look in the test cases for some examples of use? https://github.com/apache/calcite/blob/master/elasticsearch5/src/test/java/org/apache/calcite/test/Elasticsearch5AdapterIT.java What have you tried?
Re: Support of query for complex/nested data in calcite
I created:[CALCITE-1999] UNNEST generates a NPE when the parameter is null - ASF JIRA [CALCITE-2000] UNNEST a collection that has a field with nested data generates an Exception - ASF JIRA The syntax you posted as the equivalent standard query did not work (wrong syntax): select d.name, e.empid, e.salary from hr.depts d left cross join unnest(d.employees) e What I managed to work for a left join is: select d.name, e.empid, e.salary from hr.depts d left join unnest(d.employees) e on true The syntax below also worked but it uses inner join: select d.name, e.empid, e.salary from hr.depts d cross join unnest(d.employees) e Regards, Luis Fernando Em quarta-feira, 4 de outubro de 2017 17:54:23 BRT, Julian Hyde <jhyde.apa...@gmail.com> escreveu: Regarding nulls and FlatList. Can you please log a bug. Regarding the Impala query select d.name, e.empid, e.salary from hr.depts d left join d.employees e The equivalent standard query is select d.name, e.empid, e.salary from hr.depts d left cross join unnest(d.employees) e I can see how the Impala’s syntactic sugar might be helpful, although frankly it’s not much shorter. Feel free to log a JIRA case. Julian > On Oct 4, 2017, at 12:07 PM, Luis Fernando Kauer > <lfka...@yahoo.com.br.INVALID> wrote: > > > > Hi Joel. > > Nice to hear someone else is also concerned about this. > I'm more concerned about nested structures than unnesting simple arrays.What > I found so far:- the nested collection must exist (can't be null) or it > generates a NPE. In "hr" schema, try creating a department with employees > collection as null. Unfortunately, this is a very common issue for me.- the > join type is "inner" by default, but since I have many empty/null nested > collections, often I need "left join" instead.- a nested structure collection > from a nested structure collection (nested more than one level) always > generates errors when trying to unnest the first level. This is because > Calcite uses FlatLists internally and a nested collection field usually > contains an array or List, which are not Comparable, so when it tries do add > the value of the column with nested data into the FlatList it generates an > error.- unnest reads the value the column passed as parameter with the nested > data for the current record and iterates through it. So the column must > return the whole nested collection with all nested fields loaded. But it > would be nice if somehow it could select only the nested fields that are used > in the query.- It may not be the standard, but I like Impala's syntax where > you can simply query:select d.name, e.empid, e.salary from hr.depts d left > join d.employees eI think it is very simple to use and to understand. > Best regards, > > Luis Fernando > > > Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff > <joel.pf...@gmail.com> escreveu: > > > > > > Hello, > > Sorry I was not registered in the mailing list when Luis Fernando Kauer > sent the mail, so I cannot reply directly to his. > But we share a similar concern with ability to implement more > precise/efficient queries on complex documents using SQL. > > Currently in our system, the management of complex types requires using > UNNEST, and there is no way to express a transformation inside a nested > document without flattening it first. Or this flattening operation can be > expensive if it has to be followed by a GROUP BY clause to recreate a > similar document but with some nested data being filtered and/or aggregated > (min/max/avg/concatenation). > > For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords > (see https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays) > and by allowing implicit joins between simple columns and nested columns, > it allows computations to stay local to the document. > > Inside Impala, it is possible to create correlated sub-queries that apply > on a repetition inside a document (see > https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html). > There is another kind of syntax, different from the one from Big Query, but > it works. > > In Presto, it is possible to apply lambda function to map/array to > transform the structure and apply filter on it. So you have filter, > map_filter function to filter array and map respectively (see > https://prestodb.io/docs/current/functions/lambda.html#filter). So it is > yet another syntax. > > From calcite's code, I can see that a good part of the code related to the > parsing is already here, there is already support of MAP/ARRAY/MULTISET, > and I am wondering what would be missing/different from Google Big Query > s
[jira] [Created] (CALCITE-2000) UNNEST a collection that has a field with nested data generates an Exception
Luis Fernando Kauer created CALCITE-2000: Summary: UNNEST a collection that has a field with nested data generates an Exception Key: CALCITE-2000 URL: https://issues.apache.org/jira/browse/CALCITE-2000 Project: Calcite Issue Type: Bug Reporter: Luis Fernando Kauer Assignee: Julian Hyde When unnesting a column that has another level of nesting it generates the following exception: {noformat} Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at org.apache.calcite.runtime.SqlFunctions$ProductComparableListEnumerator.current(SqlFunctions.java:2312) at org.apache.calcite.runtime.SqlFunctions$ProductComparableListEnumerator.current(SqlFunctions.java:1) at org.apache.calcite.linq4j.EnumerableDefaults$17$1.current(EnumerableDefaults.java:1968) at org.apache.calcite.linq4j.EnumerableDefaults$11$1.moveNext(EnumerableDefaults.java:1225) at Baz$4$1.moveNext(Unknown Source) at org.apache.calcite.linq4j.Linq4j$EnumeratorIterator.(Linq4j.java:680) at org.apache.calcite.linq4j.Linq4j.enumeratorIterator(Linq4j.java:98) at org.apache.calcite.linq4j.AbstractEnumerable.iterator(AbstractEnumerable.java:33) at org.apache.calcite.avatica.MetaImpl.createCursor(MetaImpl.java:90) at org.apache.calcite.avatica.AvaticaResultSet.execute(AvaticaResultSet.java:206) at org.apache.calcite.jdbc.CalciteResultSet.execute(CalciteResultSet.java:67) at org.apache.calcite.jdbc.CalciteResultSet.execute(CalciteResultSet.java:1) at org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:630) at org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:607) at org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:638) at org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:149) ... 29 more {noformat} The problem is that Calcite uses FlatLists internally that expect items to implement Comparable. However, a column with a nested collection will have an array or a List as the column value, which do not implement Comparable, so it generates the Exception. To test this in JdbcTest, I created a field in Employee with a collection. For simplicity, I created a field called dependents in Employee class: {code:java} public static class HrSchema { @Override public String toString() { return "HrSchema"; } public final Dependent[] dependents = { new Dependent(10, "Michael"), new Dependent(10, "Jane"), }; public final Employee[] emps = { new Employee(100, 10, "Bill", 1, 1000, Arrays.asList(dependents)), new Employee(200, 20, "Eric", 8000, 500, Collections.emptyList()), new Employee(150, 10, "Sebastian", 7000, null, Collections.emptyList()), new Employee(110, 10, "Theodore", 11500, 250, Collections.emptyList()), }; public final Department[] depts = { new Department(10, "Sales", Arrays.asList(emps[0], emps[2]), new Location(-122, 38)), new Department(30, "Marketing", Collections.emptyList(), new Location(0, 52)), new Department(40, "HR", Collections.singletonList(emps[1]), null), }; public final Dependent[] locations = { new Dependent(10, "San Francisco"), new Dependent(20, "San Diego"), }; public QueryableTable foo(int count) { return Smalls.generateStrings(count); } public TranslatableTable view(String s) { return Smalls.view(s); } } public static class Employee { public final int empid; public final int deptno; public final String name; public final float salary; public final Integer commission; public final List dependents; public Employee(int empid, int deptno, String name, float salary, Integer commission) { this(empid, deptno, name, salary, commission, Collections.emptyList()); } public Employee(int empid, int deptno, String name, float salary, Integer commission, List dependents) { this.empid = empid; this.deptno = deptno; this.name = name; this.salary = salary; this.commission = commission; this.dependents = dependents; } @Override public String toString() { return "Employee [empid: " + empid + ", deptno: " + deptno + ", name: " + name + "]"; } @Override public boolean equals(Object obj) { return obj == this || obj instanceof Employee && empid == ((Employee) obj).empid; } } {code} Just running the test case JdbcTest.testUnnestArrayColumn generates the exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (CALCITE-1999) UNNEST generates a NPE when the parameter is null
Luis Fernando Kauer created CALCITE-1999: Summary: UNNEST generates a NPE when the parameter is null Key: CALCITE-1999 URL: https://issues.apache.org/jira/browse/CALCITE-1999 Project: Calcite Issue Type: Bug Reporter: Luis Fernando Kauer Assignee: Julian Hyde It happens when trying to unnest a column with a nested collection when the column is null. To reproduce it, I changed JdbcTest, addind a Department with a null collection of employees to HrSchema: {code:java} new Department(50, "NULL", null, null), {code} Just running the testCase JdbcTest.testUnnestArrayColumn() results in the NPE NPE also happens running the following query: {code:java} select * from unnest(array(select * from \"hr\".\"depts\" where \"deptno\"=-1)) a {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Support of query for complex/nested data in calcite
Hi Joel. Nice to hear someone else is also concerned about this. I'm more concerned about nested structures than unnesting simple arrays.What I found so far:- the nested collection must exist (can't be null) or it generates a NPE. In "hr" schema, try creating a department with employees collection as null. Unfortunately, this is a very common issue for me.- the join type is "inner" by default, but since I have many empty/null nested collections, often I need "left join" instead.- a nested structure collection from a nested structure collection (nested more than one level) always generates errors when trying to unnest the first level. This is because Calcite uses FlatLists internally and a nested collection field usually contains an array or List, which are not Comparable, so when it tries do add the value of the column with nested data into the FlatList it generates an error.- unnest reads the value the column passed as parameter with the nested data for the current record and iterates through it. So the column must return the whole nested collection with all nested fields loaded. But it would be nice if somehow it could select only the nested fields that are used in the query.- It may not be the standard, but I like Impala's syntax where you can simply query:select d.name, e.empid, e.salary from hr.depts d left join d.employees eI think it is very simple to use and to understand. Best regards, Luis Fernando Em quarta-feira, 4 de outubro de 2017 11:26:38 BRT, Joel Pfaff <joel.pf...@gmail.com> escreveu: Hello, Sorry I was not registered in the mailing list when Luis Fernando Kauer sent the mail, so I cannot reply directly to his. But we share a similar concern with ability to implement more precise/efficient queries on complex documents using SQL. Currently in our system, the management of complex types requires using UNNEST, and there is no way to express a transformation inside a nested document without flattening it first. Or this flattening operation can be expensive if it has to be followed by a GROUP BY clause to recreate a similar document but with some nested data being filtered and/or aggregated (min/max/avg/concatenation). For similar purposes, Google Big Query proposes UNNEST and ARRAY keywords (see https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays) and by allowing implicit joins between simple columns and nested columns, it allows computations to stay local to the document. Inside Impala, it is possible to create correlated sub-queries that apply on a repetition inside a document (see https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html). There is another kind of syntax, different from the one from Big Query, but it works. In Presto, it is possible to apply lambda function to map/array to transform the structure and apply filter on it. So you have filter, map_filter function to filter array and map respectively (see https://prestodb.io/docs/current/functions/lambda.html#filter). So it is yet another syntax. >From calcite's code, I can see that a good part of the code related to the parsing is already here, there is already support of MAP/ARRAY/MULTISET, and I am wondering what would be missing/different from Google Big Query syntax? Si I started to play, and I managed to run such a query end to end using a test in JdbcTest: @Test public void testUnnestArrayColumnWithFilter() { CalciteAssert.hr() .query( "select d.\"deptno\",\n" + " array(select x.\"empid\" * 2 from UNNEST(d.\"employees\") AS x WHERE x.\"empid\" > 2) AS doubled_empno" + " from \"hr\".\"depts\" as d") .returnsUnordered( "deptno=10; DOUBLED_EMPNO=[{200}, {300}]", "deptno=30; DOUBLED_EMPNO=[]", "deptno=40; DOUBLED_EMPNO=[{400}]"); } SQL Parsing and Validation and Execution appeared to work just fine, and that looks really nice. I then used the explainContains() on that query to get access to the plan, and I got: PLAN= EnumerableCalc(expr#0..4=[{inputs}], deptno=[$t0], DOUBLED_EMPNO=[$t4]) EnumerableCorrelate(correlation=[$cor0], joinType=[inner], requiredColumns=[{2}]) EnumerableTableScan(table=[[hr, depts]]) EnumerableCollect(field=[EXPR$0]) EnumerableCalc(expr#0..4=[{inputs}], expr#5=[2], expr#6=[*($t0, $t5)], expr#7=[>($t0, $t5)], EXPR$0=[$t6], $condition=[$t7]) EnumerableUncollect EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0], expr#2=[$t1.employees], employees=[$t2]) EnumerableValues(tuples=[[{ 0 }]]) I wanted to check the plan generated by calcite in these queries, to validate that the aggregations stay local to the document, and do not trigger a full flatten/group by, in the case it is not needed. And I am ho
Query complex/nested data
Hi, We have many data in complex formats, similar to a json file, with nested collections and I would like to know what is the best approach to implement an adapter to read it efficiently.Users will probably end up joining the top level struct with the nested collections of struct when they need the nested data, so this is something I want to assure that works.But I would like the adapter to be able to select the used columns to be read from the nested collections, because they can be very large.It seems the function UNNEST is the way to make the nested collection available as a table in Calcite, but the whole collection is read with all the columns, so it is not efficient.Probably I'll need rules to push the projects used from the nested collections to the TableScan.It would be even greater if the query didn't need to use a function like UNNEST, but simply allow a join between the table and the nested collection field.Is there anything available for this?Any suggestion? Best regards, Luis Fernando
Re: Two newbie questions regarding Calcite Cassandra Adapter
Hi, I'll try to answer your questions since nobody else did. But I'm still fairly new to Calcite and I'm still learning it. I'll answer your questions in reverse order because the answer to the second question is important to answer the first question. 2. ConverterRules: how Calcite knows which LogicalProject conversion rule to apply Calcite has the concept of "Convention", which determines the calling convention. Relational expressions start with Convention.NONE and each adapter sets its own convention when implementing TranslatableTable.toRel. CassandraRel implements a Convention for Cassandra. MongoRel implements a Convention for MongoDB. CassandraTable.toRel returns a CassandraTableScan with a traitSet of CassandraRel.CONVENTION. If you check the rules, there are 2 types basically: - Normal rule (extends RelOptRule): rule defined by "operands" - Conversion rule (extends ConverterRule): defined by a class and a traitSet (convention) The converter rule uses the convention of the input to match the rule. CassandraProjectRule is a ConverterRule so it matches only when the input has CassandraRel.CONVENTION. So it won't match a LogicalProject with MongoRel.CONVENTION. The adapters also have a rule to convert between its convention and the EnumerableConvention.INSTANCE (CassandraToEnumerableConverter). So the convention is the key to know which conversion rule to match and apply. 1. Different plans with different "Limit" clause Looking into calcite-cassandra adapter I noticed that some of its rules are not convertion rules and they try to convert/apply to nodes already converted to Enumerable. Take a look at CassandraLimitRule: super(operand(EnumerableLimit.class, operand(CassandraToEnumerableConverter.class, any())), "CassandraLimitRule"); The problem with this approach (IMHO) is that it depends heavily on the overall costs to have the CassandraLimit matched and chosen, but this kind of node should always be pushed. That's what is happening when the limit is 1 in your query. LogicalSort is used for sorting but also for limit (fetch) and offset. Cassandra adapter has a conversion rule for Sort but not when offset or limit is set. I don't know the details why this approach was chosen, maybe it was because the jdbc adapter is a reference and it does not implement limit and offset correctly. But I think CassandraSortRule could use the limit (fetch) and offset information and set it in the implement method, with no need for CassandraLimit. Trying to solve this by only changing the costs is hard because the planner generates many different plans and you'll have to make sure the costs will always be smaller and keeping the enumerable nodes. Hope it helps. Best regards, Luis Fernando Em Sexta-feira, 1 de Setembro de 2017 1:13, Junwei Liescreveu: 1. I happen to notice that the value in limit could change the query plan. For example, if I do explain plan for "select \"tweet_id\" from \"userline\" limit 1", the plan is: PLAN=EnumerableCalc(expr#0..2=[{inputs}], tweet_id=[$t2]) EnumerableLimit(fetch=[1]) CassandraToEnumerableConverter CassandraTableScan(table=[[twissandra, userline]]) But if limit value no smaller than 8, such as for "select \"tweet_id\" from \"userline\" limit 8", the plan is: PLAN=CassandraToEnumerableConverter CassandraLimit(fetch=[8]) CassandraProject(tweet_id=[$2]) CassandraTableScan(table=[[twissandra, userline]]) Since computeSelfCost() depends on the row count, I guess that is why the query plans are different. But I want to make sure the query plan stay the same as the second one even the limit value is smaller than 8. How should I tweak the computeSelfCost()? 2. If I do a inner join query that spans across cassandra and another database, such as mongodb, an example for illustration purpose: "with\n" + "tweet as (\n" + "select \"tweet_id\" as id from \"twissandra\".\"userline\" where "\"username\"='!PUBLIC!' order by \"time\")\n," + "metrics as (\n" + "select \"id\" as id, \"metricName\", \"value\"\n" + "from \"mongodb\".METRIC\n" + "where \"metricName\" = 'cpu.usage.average'\n" + ")\n" + "select tweet.id, metrics.\"value\" from tweet\n" + "inner join metrics using(id)\n" + "where metrics.\"value\" > 70" According to CassandraProjectRule, super(LogicalProject.class, "CassandraProjectRule"), seems like as long as there is a LogicalProject.class, this CassandraProjectRule will get triggered. Does this mean LogicalProject that belongs to mongodb will also get matched with CassandraProjectRule? But when I do explain plan, the generated plan does separate the projections correctly to Cassandra and mongodb. How can Calcite tell which projection belong to which schema? Thanks.
BindableTableScan.computeSelfCost does not use projects or filters info into account
At first I was using TranslatableTable for my adapters (read files in different formats), but I realized that many rules apply only to ProjectableFilterableTable, so I started using it to take advantage of all these builtin rules. Restricting the projects that need to be scanned is very important because it can reduce a lot the time to scan the table. Unfortunately I'm having trouble making Calcite select a plan that pushes the used projects to the table scan in simple queries that have a filter. The plan selected has only the filters and no projects. What seems to be happening is that BindableFilterRule and BindableProjectRule are applied generating different plans, some with only the filter, some with only the projects and some with both. The planner has to decide which one to choose, but BindableTableScan.computeSelfCost just multiplies the cost by 0.01, without taking into account the restriction of projects or the existence of filters. Since the different plans with and without the projects and filters end up with the same cost, the planner chooses the first plan, which in my case has been a BindableTableScan with only the filters, generated by the BindableFilterRule. The query is something like: select c1, c2 from t where c3=10 Changing BindableTableScan.computeSelfCost to calculate different costs according to projects and filters made the planner choose it right, but it would be nice to allow each adapter/table to define its own costs, since projects and filters can have different costs.
Re: Rule to remove right part of a left join when no project is used after the join
> I think you also need a check that the right-hand key is unique. For > example, it's not valid to rewrite > > select d.name from dept left join emp using (deptno) > > to > > select d.name from dept > > because the second query gives fewer rows. Yes, you are right. I forgot about these cases where it can increase the number of rows. My main use case for the rule is to simplify a query that uses a view with many joins to dimension tables. So how can I check if the right-hand key is unique? Jdbc-adapter doesn't seem to load any information about keys. You mentioned RelReferencialConstraint. How can I use it? I'll try to implement your other comments too. Thank you. Luis Fernando Em Sexta-feira, 28 de Julho de 2017 14:27, Julian Hyde <jh...@apache.org> escreveu: That looks good. I think you also need a check that the right-hand key is unique. For example, it's not valid to rewrite select d.name from dept left join emp using (deptno) to select d.name from dept because the second query gives fewer rows. I think there's some library code to ensure that a join only uses fields from a particular input. Often people choose to parameterize rule with classes. Your rule, for instance, could work on any sub-class of Project and Join, not just LogicalProject and LogicalJoin. I also strongly recommend that you add sufficient tests to RelOptRulesTest to prove that the rule handles the main cases. Including cases, like above, where it would be wrong to fire the rule. If you intend to contribute this at some point, please file a JIRA case. You can attach a PR later. Julian On Fri, Jul 28, 2017 at 7:53 AM, Luis Fernando Kauer <lfka...@yahoo.com.br.invalid> wrote: > I'm still learning Calcite and how to convert and transform the nodes, but I > managed to create a rule to do this that seems to be working, but I would > like your comments about flaws and improvements. > It drops the right hand side of a join when the join type is "left" and the > project above the join includes only field reference to the input from the > left. > > public class LeftJoinRemoveRule extends RelOptRule { > public static final LeftJoinRemoveRule INSTANCE = > new LeftJoinRemoveRule(); > > //~ Constructors --- > > /** Creates a SemiJoinRemoveRule. */ > private LeftJoinRemoveRule() { > super(operand(LogicalProject.class, > operand(LogicalJoin.class, any(; > } > > //~ Methods > > public void onMatch(RelOptRuleCall call) { > LogicalProject project = call.rel(0); > LogicalJoin join = call.rel(1); > RelNode input = join.getInput(0); > int fieldCount = input.getRowType().getFieldCount(); > if( join.getJoinType()!=JoinRelType.LEFT) > return; > for(RexNode rex: project.getProjects()) { > if (! (rex instanceof RexInputRef)) { > return; > } > RexInputRef ref = (RexInputRef) rex; > if( ref.getIndex() >= fieldCount ) { > return; > } > } >call.transformTo(project.copy(project.getTraitSet(), > ImmutableList.of(input))); > } > } > > Best regards, > > Luis Fernando > > Em Sexta-feira, 21 de Julho de 2017 22:21, Julian Hyde <jh...@apache.org> > escreveu: > > > > As you're probably aware, It's only safe to drop a table if (a) none > of its fields are used, (b) the join is to a unique key of that table, > (c) there is a referential constraint to that table. > > Without conditions (b) and (c) are necessary, the table could be > converting each row on the left to zero or more than one rows, and > therefore removing the join would affect the result. > > This kind of schema is common in data warehousing environments, so I > created the Lattice construct. Tables in a lattice implicitly have > many-to-one relationships. That makes it safe to transform a query on, > say (Sales JOIN Customer) to use a materialized view on (Sales JOIN > Customer) JOIN Product), because adding/removing the Product table > will not affect the row count. > > With Lattices the referential constraints are implicit, but we didn't > have the explicit referential constraints that made it safe to do this > transform. Jesus added them (class RelReferentialConstraint) as part > of https://issues.apache.org/jira/browse/CALCITE-1731. > Maybe you can leverage them in a rule you write. It would match > (Project (Join A B)) and convert to (Project A). > > I don't recall how to add rules. Some people use Frameworks or Planner. > > Julian > > > > > On Fri, Jul 21, 2017 at 12:42 PM, Luis Fernando Kauer > <lfka...@yahoo.com.br.inv
Re: Rule to remove right part of a left join when no project is used after the join
I'm still learning Calcite and how to convert and transform the nodes, but I managed to create a rule to do this that seems to be working, but I would like your comments about flaws and improvements. It drops the right hand side of a join when the join type is "left" and the project above the join includes only field reference to the input from the left. public class LeftJoinRemoveRule extends RelOptRule { public static final LeftJoinRemoveRule INSTANCE = new LeftJoinRemoveRule(); //~ Constructors --- /** Creates a SemiJoinRemoveRule. */ private LeftJoinRemoveRule() { super(operand(LogicalProject.class, operand(LogicalJoin.class, any(; } //~ Methods public void onMatch(RelOptRuleCall call) { LogicalProject project = call.rel(0); LogicalJoin join = call.rel(1); RelNode input = join.getInput(0); int fieldCount = input.getRowType().getFieldCount(); if( join.getJoinType()!=JoinRelType.LEFT) return; for(RexNode rex: project.getProjects()) { if (! (rex instanceof RexInputRef)) { return; } RexInputRef ref = (RexInputRef) rex; if( ref.getIndex() >= fieldCount ) { return; } } call.transformTo(project.copy(project.getTraitSet(), ImmutableList.of(input))); } } Best regards, Luis Fernando Em Sexta-feira, 21 de Julho de 2017 22:21, Julian Hyde <jh...@apache.org> escreveu: As you're probably aware, It's only safe to drop a table if (a) none of its fields are used, (b) the join is to a unique key of that table, (c) there is a referential constraint to that table. Without conditions (b) and (c) are necessary, the table could be converting each row on the left to zero or more than one rows, and therefore removing the join would affect the result. This kind of schema is common in data warehousing environments, so I created the Lattice construct. Tables in a lattice implicitly have many-to-one relationships. That makes it safe to transform a query on, say (Sales JOIN Customer) to use a materialized view on (Sales JOIN Customer) JOIN Product), because adding/removing the Product table will not affect the row count. With Lattices the referential constraints are implicit, but we didn't have the explicit referential constraints that made it safe to do this transform. Jesus added them (class RelReferentialConstraint) as part of https://issues.apache.org/jira/browse/CALCITE-1731. Maybe you can leverage them in a rule you write. It would match (Project (Join A B)) and convert to (Project A). I don't recall how to add rules. Some people use Frameworks or Planner. Julian On Fri, Jul 21, 2017 at 12:42 PM, Luis Fernando Kauer <lfka...@yahoo.com.br.invalid> wrote: > Hi, > > I have a schema with a lot of normalized tables, which makes it hard for > users to build the queries. > > I would like to create views with all the joins already included so that > users could just select the fields without worrying with multiple tables and > join conditions. > > > But the optimizer should drop all the tables that are not actually used (when > possible), otherwise the performance is not good. > > Any suggestion? > > How can I register my own rules besides the build in rules when connecting > through CalciteConnection? > > Best regards, > > Luis Fernando
[jira] [Created] (CALCITE-1906) JdbcSortRule has a bug and it is chosen
Luis Fernando Kauer created CALCITE-1906: Summary: JdbcSortRule has a bug and it is chosen Key: CALCITE-1906 URL: https://issues.apache.org/jira/browse/CALCITE-1906 Project: Calcite Issue Type: Bug Components: jdbc-adapter Reporter: Luis Fernando Kauer Assignee: Julian Hyde JdbcSortRule tries to push sort and limit operations to the database. Currently offset and limit operations are explicitly not pushed to the database (prevented by the rule) but even sort operations end up not being pushed. Checking how other adapters deal with this, like Mongo and Cassandra adapters, I realized that the convert function from JdbcSortRule is different from the others. Jdbc-adapter: {{ if (sort.offset != null || sort.fetch != null) { // Cannot implement "OFFSET n FETCH n" currently. return null; } final RelTraitSet traitSet = sort.getTraitSet().replace(out); return new JdbcSort(rel.getCluster(), traitSet, convert(sort.getInput(), traitSet), sort.getCollation()); }} mongodb-adapter: {{ final RelTraitSet traitSet = sort.getTraitSet().replace(out) .replace(sort.getCollation()); return new MongoSort(rel.getCluster(), traitSet, convert(sort.getInput(), traitSet.replace(RelCollations.EMPTY)), sort.getCollation(), sort.offset, sort.fetch); }} By fixing JdbcSortRule so that it is just like those others and by removing the code that prevented the rule to match when limit or offset are used seems to solve the problem and JdbcSortRule now is being applied and both sort and limit are being pushed to the database. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Rule to remove right part of a left join when no project is used after the join
Hi, I have a schema with a lot of normalized tables, which makes it hard for users to build the queries. I would like to create views with all the joins already included so that users could just select the fields without worrying with multiple tables and join conditions. But the optimizer should drop all the tables that are not actually used (when possible), otherwise the performance is not good. Any suggestion? How can I register my own rules besides the build in rules when connecting through CalciteConnection? Best regards, Luis Fernando
Re: Explain Plan for aggregating a single column in CSV Adapter
Hi, If I change CsvTranslatableTable so that it implements ProjectableFilterableTable instead of TranslatableTable and implement the scan method, Calcite's own rules apply and the plan gets right, scanning only the used field in the aggregate function. However, now I realized that "select count(*) from EMPS" generates the plan: EnumerableAggregate(group=[{}], EXPR$0=[COUNT()]) CsvTableScan(table=[[SALES, EMPS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]) "select * from EMPS" generates the plan: CsvTableScan(table=[[SALES, EMPS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]) Notice that the count(*) generates a plan that scans all fields, requiring to convert them all without the need. Even when using ProjectableFilterableTable plan scans all fields, but the plan for "select count(name) from EMPS" scans just one field. What could be the best approach to handle the count(*) without having to scan all fields? Best regards, Luis Fernando Em Quinta-feira, 6 de Julho de 2017 18:05, Julian Hyde <jh...@apache.org> escreveu: Calcite should realize that Aggregate has an implied Project (because it only uses a few columns) and push that projection into the CsvTableScan, but it doesn’t. I think we need a new rule for Aggregate on a TableScan of a ProjectableFilterableTable. Can you create a JIRA case please? I created a test case. It currently fails: diff --git a/example/csv/src/test/java/org/apache/calcite/test/CsvTest.java b/example/csv/src/test/java/org/apache/calcite/test/CsvTest.java index 00c59ee..2402872 100644 --- a/example/csv/src/test/java/org/apache/calcite/test/CsvTest.java +++ b/example/csv/src/test/java/org/apache/calcite/test/CsvTest.java @@ -241,6 +241,13 @@ public Void apply(ResultSet resultSet) { .ok(); } + @Test public void testAggregateImpliesProject() throws SQLException { +final String sql = "select max(name) from EMPS"; +final String plan = "PLAN=EnumerableAggregate(group=[{}], EXPR$0=[MAX($0)])\n" ++ " CsvTableScan(table=[[SALES, EMPS]], fields=[[1]])\n"; +sql("smart", "explain plan for " + sql).returns(plan).ok(); + } + @Test public void testFilterableSelect() throws SQLException { sql("filterable-model", "select name from EMPS").ok(); } Julian > On Jul 6, 2017, at 1:23 PM, Luis Fernando Kauer > <lfka...@yahoo.com.br.INVALID> wrote: > > Hi, > I'm trying to understand the CSV Adapter and how the rules are fired.The > CsvProjectTableScanRule gets fired when I use CsvTranslatableTable.But I'm > not understanding why I'm getting a plan that scans all fields when I use an > aggregate function.For example:explain plan for select name from > emps;CsvTableScan(table=[[SALES, EMPS]], fields=[[1]]) > > explain plan for select max(name) from emps;EnumerableAggregate(group=[{}], > EXPR$0=[MAX($1)])CsvTableScan(table=[[SALES, EMPS]], fields=[[0, 1, 2, 3, 4, > 5, 6, 7, 8, 9]]) > I noticed that the rule gets fired and at that point it shows just 1 field > being used.But the last time CsvTableScan.deriveRowType() gets called it has > all the fields set, and it's not the instance create by the rule, but the > first instance created with all the fields. > Can anybody explain me if this is a bug or if this is supposed to happen with > aggregate functions ? > Best regards, > Luis Fernando Kauer