Re: [DISCUSS] Ensuring that Calcite is consistent with other SQL systems

Bertil Chapuis Mon, 26 Feb 2024 22:28:30 -0800

Having the ability to write tests with testcontainers would greatly help at 
improving the PostGIS dialect.


Until recently, the official docker image for PostGIS lacked support for arm64. 
However, one can easily change the underlying image used by testcontainers [1]. 
I guess that spawning a Spark SQL container shouldn’t be too difficult.

As a side note, I recently opened CALCITE-6281 [2]. From what I understand, it 
is currently not possible to push down UDFs (such as the ST functions) to the 
database. More work in operator tables seems to be required and additional 
tests would probably help at clarifying the situation.

[1] 
https://github.com/bchapuis/calcite-postgis-tests/blob/f7d7df1dbff67d5beed2872f44f92b85ed18c18f/src/test/java/com/github/bchapuis/calcite_postgis_tests/AbstractPostgisTest.java#L38
[2] https://issues.apache.org/jira/browse/CALCITE-6271



> On 26 Feb 2024, at 23:48, Julian Hyde <jh...@apache.org> wrote:
> 
> I have logged https://issues.apache.org/jira/browse/CALCITE-6281.
> 
> On Mon, Feb 26, 2024 at 12:06 PM Hanumath Maduri
> <hanumathmad...@gmail.com> wrote:
>> 
>> It seems to have support for PostGIS, but I didn't test it though.
>> 
>> https://github.com/testcontainers/testcontainers-java/blob/main/modules/postgresql/src/main/java/org/testcontainers/containers/PostgisContainerProvider.java
>> https://testcontainers.com/modules/postgis/
>> 
>> Thanks
>> Hanu
>> 
>> On Mon, Feb 26, 2024 at 11:20 AM Julian Hyde <jh...@apache.org> wrote:
>> 
>>> One more thing. The geospatial library, including Bertil's proposed
>>> PostGIS dialect [5], would also benefit from this effort. We would
>>> want to ensure that "ST_Contains(ST_Point(0.0, 0.0), ST_Point(0.0,
>>> 0.0))" returns the same value on PostGIS as Calcite. Does
>>> testContainers support PostGIS?
>>> 
>>> Julian
>>> 
>>> [5] https://issues.apache.org/jira/browse/CALCITE-6239
>>> 
>>> On Mon, Feb 26, 2024 at 11:07 AM Julian Hyde <jh...@apache.org> wrote:
>>>> 
>>>> Thank you, everyone for your thoughts so far. (And let's keep talking!)
>>>> 
>>>> I don't yet see a full, perfect solution. But I see a number of
>>>> techniques/tools we can use. For example, I was not aware of
>>>> testContainers but it seems to very quickly address our most common
>>>> use cases (MySQL + Postgres). Also, while I like Quidem it probably
>>>> should not be our first solution to this problem.
>>>> 
>>>> I have two concerns about testContainers:
>>>> 1. it makes our test suite heavier: it adds latency, memory
>>>> requirements, flakiness, and the requirement for Docker;
>>>> 2. it doesn't solve other important dialects, such as BigQuery.
>>>> 
>>>> To solve (1), we could add the tests to an optional module, such as
>>>> 'plus'. To solve (2) we need to provide another way to get a
>>>> connection to a reference database. (We need that for the other huge
>>>> project, the converse of this project, to test whether Calcite's
>>>> generated SQL is compliant [1].)
>>>> 
>>>> I think we should start by merging Stamatis' PR into the 'plus'
>>>> module, and add some tests for CONCAT, which we have already
>>>> implemented, and has different behaviors in MySQL, Postgres, Oracle
>>>> [2][3]. Let's enable that test in CI and make sure that it doesn't
>>>> introduce flakiness.
>>>> 
>>>> Next, we could make that test invoke tests that are already defined in
>>>> SqlOperatorTest, in addition to/instead of, the CSV data in Stamatis'
>>>> PR.
>>>> 
>>>> Next, devise a way to test against Spark SQL and BigQuery, and check
>>>> our implementation of SOUNDEX (different in Spark than
>>>> Postgres/Oracle) and SUBSTR/SUBSTRING (different on Postgres/BigQuery)
>>>> [4].
>>>> 
>>>> After that, Quidem, somehow.
>>>> 
>>>> Julian
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/CALCITE-5529
>>>> [2] https://issues.apache.org/jira/browse/CALCITE-5771
>>>> [3] https://issues.apache.org/jira/browse/CALCITE-5745
>>>> [4] https://issues.apache.org/jira/browse/CALCITE-4427
>>>> 
>>>> On Mon, Feb 26, 2024 at 10:14 AM Mihai Budiu <mbu...@gmail.com> wrote:
>>>>> 
>>>>> I have already contributed an implementation of Sql Logic Test for
>>> Calcite, it is part of the "plus" project of Calcite.
>>>>> 
>>>>> SLT is completely complementary with the problem described here. SLT
>>> tests only the very core of the SQL language, which should be common to all
>>> implementations. It only uses 3 data types (integer, string, decimal), and
>>> it never calls any non-standard functions.
>>>>> 
>>>>> If you configure Calcite in a specific way, it can pass all SLT tests.
>>> But some optimization passes are unsound, and some of these are used in the
>>> default configuration (when you use Calcite through a JDBC connection).
>>> That's the reason why currently the SLT tests are not being regularly run.
>>>>> 
>>>>> In our project we took a very similar approach to the one described by
>>> Julian: when we adopt a SQL function from a dialect (we attempt to support
>>> functions from many dialects), we test it thoroughly on the original
>>> database and write all the same tests using our Calcite implementation. If
>>> something doesn't match, I file an issue with Calcite.
>>>>> 
>>>>> We test
>>>>> 
>>>>>  *
>>>>> compile-time evaluation (for constant expressions)
>>>>>  *
>>>>> run-time evaluation (including of the same expressions)
>>>>>  *
>>>>> tests that should fail at compilation time (e.g. type errors)
>>>>>  *
>>>>> tests that should fail at runtime (e.g., division by zero)
>>>>> 
>>>>> We try to take the tests from the official test suites, e.g.: Postgres
>>> regress:
>>> https://github.com/postgres/postgres/tree/bcd5b4bcbefc8b99be3678cf9c6c990a0c28f68e/src/test/regress/expected
>>>>> We have written parsers that read the expected output from the
>>> official tests so that we can validate the results (for positive tests)
>>> while reusing as much as possible the original tests without introducing
>>> bugs in the translation of the tests.
>>>>> 
>>>>> As Yiwen Wu says, in general it is impossible to execute a function
>>> for a specific dialect on a different runtime and get the same result. But
>>> this does not mean we shouldn't implement the functions precisely. In our
>>> runtime we reimplement all the functions in Rust, and we can support
>>> functions from different runtimes with no problem. But we still need the
>>> Java implementation to match the Rust implementation and with the original
>>> database implementation.
>>>>> 
>>>>> Quidem may be the best way to do it, but it needs negative tests too.
>>>>> 
>>>>> Mihai
>>>>> 
>>>>> ________________________________
>>>>> From: Stamatis Zampetakis <zabe...@gmail.com>
>>>>> Sent: Monday, February 26, 2024 5:31 AM
>>>>> To: dev@calcite.apache.org <dev@calcite.apache.org>
>>>>> Subject: Re: [DISCUSS] Ensuring that Calcite is consistent with other
>>> SQL systems
>>>>> 
>>>>> The Quidem approach definitely makes sense.
>>>>> 
>>>>> Alternatively, since we are focusing on the behavior of functions with
>>>>> certain inputs it may be simpler to go with a Java parameterized test
>>>>> backed by a CSV/JSON file ([1] outlines the general idea).
>>>>> As others mentioned, using the org.testcontainers framework can
>>>>> significantly alleviate the burden of setting up and managing
>>>>> different DBMS.
>>>>> 
>>>>> Best,
>>>>> Stamatis
>>>>> 
>>>>> [1] https://github.com/apache/calcite/pull/3704
>>>>> 
>>>>> 
>>>>> On Mon, Feb 26, 2024 at 4:01 AM Yiwen Wu <yiwenwu1...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> I think it is difficult and impossible to make calcite function
>>> results
>>>>>> completely consistent with other SQL systems. In practice, when
>>>>>> calculations are pushed down to different data source adapters for
>>>>>> execution, there is no guarantee that the final results will be
>>> completely
>>>>>> consistent in calcite.
>>>>>> 
>>>>>> I think, for the built-in functions defined in
>>> `SqlStdOperatorTable`, we
>>>>>> can choose to follow one of the definitive SQL systems, such as
>>> Oracle. For
>>>>>> the engine-related extension functions in `SqlLibraryOperators`, we
>>> can
>>>>>> follow the behavior related to the specific engine. If an already
>>> defined
>>>>>> function behaves inconsistently with the new engine, we can add a new
>>>>>> Function to the definition, such as the `CONCAT` function.
>>>>>> 
>>>>>> At the same time, I agree it is a great suggestion to add execution
>>> tests
>>>>>> of different engines in Quidem, which is very effective in verifying
>>> the
>>>>>> engine execution behavior.
>>>>>> 
>>>>>> Cancai Cai <caic68...@gmail.com> 于2024年2月26日周一 09:59写道：
>>>>>> 
>>>>>>> Thank you very much to the calcite community for raising these
>>> questions.
>>>>>>> This is what I have been doubting. I am very sorry that this doubt
>>> has been
>>>>>>> discussed for so long.
>>>>>>> 
>>>>>>> Maybe we also need to consider another issue, that is, the
>>> database version
>>>>>>> issue. Versions like mysql and postgres are very stable, but
>>> components
>>>>>>> like spark still seem to have function bugs. exists, then how
>>> should we
>>>>>>> consider version issues?
>>>>>>> 
>>>>>>> I don't know what I can do. Recently I am sorting out some
>>> documents about
>>>>>>> the use of some functions of mysql and postgres in calcite. I
>>> don't know if
>>>>>>> this is helpful.
>>>>>>> 
>>>>>>> Best wishes,
>>>>>>> Cancai Cai
>>>>>>> 
>>>

signature.asc
Description: Message signed with OpenPGP

Re: [DISCUSS] Ensuring that Calcite is consistent with other SQL systems

Reply via email to