Hi mbudiu,thank you very much for your patient reply. I have learned a lot of information from your reply. Best, LakeShen
<[email protected]> 于2023年7月30日周日 01:20写道: > Hello LakeShen, > > I didn't say that we are implementing the Postgres SQL semantics using > Calcite, I just said that we are trying to reuse Postgres tests for testing > Calcite. We do end up changing or ignoring a significant fraction of these > tests in the process (I would say that more than half of the tests simply > cannot run). We have still a long way to go, but from my experience I would > say that supporting the exact Postgres semantics on top of Calcite will be > a heroic effort. You can partly blame the SQL standard for leaving many > behaviors underspecified. > > I can't make an exhaustive list, but let me give you a few examples which > I encountered so far, which show why I think this is very difficult. > > * First, Calcite is built on a specific family of types. Calcite makes > some pretty deep assumptions about the range of values of such types. > * Postgres NUMERIC has Infinity values, whereas Calcite doesn't. > * The Postgres INTERVAL type can represent values that are a mix > of "long" intervals (days) and "short" intervals (seconds), whereas I think > that the SQL standard considers these as being different types. Calcite > does too. > * Calcite DATES do not support negative dates or dates before BC. > * Second, the Calcite grammar makes some choices in the representation of > literals. It's true that the grammar is extensible, but I don't know if it > is flexible enough to accommodate all changes needed to parse Postgres. > * The representation of Unicode characters in Postgres (even for > identifiers) is broader, e.g., 6-digit hex sequences like \\+000061 > * NUMERIC Postgres types need to parse the new 'Infinity' values. > * Or, let's consider just INTERVAL literals in Postgres > https://www.postgresql.org/docs/15/datatype-datetime.html#DATATYPE-INTERVAL-INPUT. > Postgres allows a much wider range of literals, in two ways: > * you can change the literal representation that is > accepted by using a SET SQL statement: e.g., "SET IntervalStyle to > postgres" or to "sql_standard". I don't think there isn't really any > equivalent mechanism in Calcite to configure the parser at runtime. > * In Postgres an interval literal can have a sign on each > of its components, whereas in Calcite you can only have one sign upfront > * A literal for an Illegal date, such as February 29 2023 > will be evaluated as NULL by Calcite but will produce an error in Postgres > * Postgres supports more types of timeUnits, such as 'Julian' or > 'Epoch' > * Third, Calcite embeds already a semantics for SQL in many of its > optimization rules. You can avoid this problem only by not using these > optimization rules, but then many optimization opportunities will be wasted. > * For example, the CoreRules.*REDUCE* rules perform expression > evaluation at compilation time. If these rules don't produce the same > results as Postgres would do at runtime you cannot use these rules. > * The typing and semantics of functions in Calcite is > customizable, but not necessarily of built-in SQL constructs. > * The semantics of string concatenation in Postgres converts all > strings to VARCHAR, whereas in Calcite it keeps the strings unchanged. As a > consequence the treatment of trailing spaces is different. > * EXTRACT(DOW FROM DATE) in Calcite returns 1 for Sunday but in > Postgres it returns 0. > * Then there's the matter of the DDL. These problem is solvable by > extending the server parser. > * the Calcite server DDL does not accept Postgres constraints, > e.g., constructs such as PRIMARY KEY which is specified essentially as an > annotation on the type of a column > * Postgres seems to support specifying COLLATION for each column > separately > > I am sure there are many more problems that we haven't hit yet. > > It really depends on your goals. I am skeptical you can achieve 100% > compatibility with Postgres. But if all you want is to be able to handle > the intersection of Postgres SQL and Calcite SQL in a similar way (i.e., > have the programs that are valid across both dialects produce the same > results), there is hope. > > Mihai > > -----Original Message----- > From: LakeShen > Sent: Friday, July 28, 2023 7:04 PM > To: [email protected] > Subject: Re: Easier and more comprehensive testing > > Hi mbudiu,thank you for bringing SqlLogicTest to calcite. > Now we're working on an optimizer based on Calcite that supports > Postgresql semantics as a whole. I am very interested in your bringing the > Postgres test to Calcite and would like to participate in it. Could you > please give me a complete collection of reference materials? > The current document information is scattered, some in sql-logic-test > github and some in sql-to-dbsp-compiler github, which requires everyone to > read on github by themselves. If you could give me a complete end-to-end > demo of adding tests, I would really appreciate it. Best, LakeShen > > <[email protected]> 于2023年7月29日周六 01:04写道: > > > The correct link is > > https://github.com/feldera/dbsp/tree/main/sql-to-dbsp-compiler, but > > it's not particularly important, I apologize for the broken one. > > > > I appreciate you pointing out the Sql Logic Test project, I wrote that > > code too. But these two projects are almost entirely disjoint. SLT > > brings a few million pre-written tests. But none of the SLT tests > > exercises any of the SQL functions. So you need to write many more > > tests for each *function* implemented, and most contributions to > > Calcite lately are functions in different dialects. > > > > Moreover, while SLT found quite a few bugs, I only had time to file > > one of them so far. For each failing test I have to figure out (1) > > whether it's a duplicate, (2) to create a minimal reproduction, (3) to > > figure out which part of the compiler is faulty in order to understand > > where to insert the test, (4) create a reproduction tailored for the > faulty component. > > Reproductions for planner bugs are different than reproductions for > > library bugs. But some bugs happen only if you combine some planner > > rules with some functions. > > > > With the proposal below you only need a CLI to a database to generate > > new tests (e.g., BigQuery): you execute write the queries in the CLI, > > then copy-paste the output into a test. This makes it easier to file > > bugs, but harder to diagnose them. But it also makes it much easier to > > write lots of tests, because it enables people without Calcite > > expertise to write the tests. > > > > Mihai > > > > -----Original Message----- > > From: stanilovsky evgeny > > Sent: Friday, July 28, 2023 2:43 AM > > To: [email protected] > > Subject: Re: Easier and more comprehensive testing > > > > Hello, your github link doesn`t open. > > plz check discussion here in dev list titled: > > > > Running Sql Logic Tests for Calcite > > This is the JIRA case: > > https://issues.apache.org/jira/browse/CALCITE-5615 > > And this is the PR: https://github.com/apache/calcite/pull/3145 > > > > > > > Hello, > > > > > > > > > I am working to test our calcite-based compiler > > > (https://github.com/feldera/dbsp/sql-to-dbsp-compiler), and since I > > > am lazy I am borrowing tests from other open-source test suites, > > > like Postgres. > > > > > > I am finding bugs in Calcite with a relatively high frequency, as > > > you may have noticed if you follow the JIRA. I would say that one in > > > 3 SQL functions I test turns out to have some problems. So our > > > techniques seem to be effective at finding bugs. > > > > > > I think that some of the techniques we are using could be applied to > > > Calcite as well: > > > > > > > > > * First, we run all tests with and without the optimizer. In > > > particular, we go through all the "constant evaluation" rules of > Calcite. > > > For constant expressions the results should be identical with and > > > without optimizer. They aren't always, and we have found quite a few > > > cases where the compile-time evaluation crashes or produces wrong > > > results. So the compile-time and the run-time evaluator cross-check > > > each other. > > > * Second, we have tried to make it very easy to write end-to-end > > > tests, at least positive tests (which are supposed to return a value > > > rather than an error). Here is a stylized excerpt from our testing > > > code: > > > > > > > > > @Override > > > > > > public void prepareData() { > > > > > > this.executeStatements("CREATE TABLE FLOAT4_TBL (f1 > > float4);\n" > > > + > > > > > > "INSERT INTO FLOAT4_TBL(f1) VALUES (' 0.0');\n" + > > > > > > "INSERT INTO FLOAT4_TBL(f1) VALUES ('1004.30 ');\n" + > > > > > > "INSERT INTO FLOAT4_TBL(f1) VALUES (' -34.84 > > > ');\n" + > > > > > > "INSERT INTO FLOAT4_TBL(f1) VALUES > > > ('1.2345678901234e+20');\n" + > > > > > > "INSERT INTO FLOAT4_TBL(f1) VALUES > > > ('1.2345678901234e-20');"); > > > > > > } > > > > > > > > > @Test > > > > > > public void testFPArithmetic() { > > > > > > this.qs("SELECT f.f1, f.f1 * '-10' AS x FROM FLOAT4_TBL f\n" > > > + > > > > > > " WHERE f.f1 > '0.0';\n" + > > > > > > " f1 | x \n" + > > > > > > "---------------+----------------\n" + > > > > > > " 1004.3 | -10043\n" + > > > > > > " 1.2345679e+20 | -1.2345678e+21\n" + > > > > > > " 1.2345679e-20 | -1.2345678e-19\n" + > > > > > > "(3 rows)\n" + > > > > > > "\n" + > > > > > > "SELECT f.f1, f.f1 + '-10' AS x FROM FLOAT4_TBL f\n" > > > + > > > > > > " WHERE f.f1 > '0.0';\n" + > > > > > > " f1 | x \n" + > > > > > > "---------------+---------------\n" + > > > > > > " 1004.3 | 994.3\n" + > > > > > > " 1.2345679e+20 | 1.2345679e+20\n" + > > > > > > " 1.2345679e-20 | -10\n" + > > > > > > "(3 rows)\n" + > > > > > > "\n"); > > > > > > } > > > > > > > > > These long strings are an almost direct copy-and-paste from tests in > > > https://github.com/postgres/postgres/blob/03734a7fed7d924679770adb78 > > > a7 > > > db8a37 d14188/src/test/regress/expected/float4.out (but they have to > > > manually checked, since Calcite does often not match Postgres > > > behaviors). A string can contain multiple query-expected result > > > strings. > > > > > > > > > The point is that the barrier for writing the tests is quite low. > > > > > > Granted, this approach has some weaknesses as well, in particular, > > > we rely on a specific output format and is brittle in some respects > > > (e.g., spaces in output strings). > > > > > > > > > I have seen some *.iq files as resources in the Calcite source code, > > > are these used in the same way for testing in Calcite? If yes, > > > that's great, and there should be many more. > > > > > > > > > Calcite has lots of unit tests, but I find that our "end-to-end" > > > tests are easier to write and have much better code coverage. For > > > example, for each failure I find in Calcite I have to dig pretty > > > hard to figure out which of the testing files should contain my > > > reproduction (e.g., SqlOperatorTest, RelOptRulesTest, > > > RelToSqlConverterTest, etc.), and how to exactly write the > > > reproduction (some reproductions require editing some huge XML files > too). > > > That takes a lot of time. > > > > > > > > > I am not necessarily signing up to build this infrastructure, at > > > least not right away. I am not sure how much of the stuff I wrote > > > for our compiler could be ported directly to Calcite. But I will > > > think about it, and I will gladly help this effort. > > > > > > > > > I appreciate any comments and suggestions, > > > > > > Mihai > > > > > >
