Re: [DISCUSS] PostgreSQL dialect

Xiao Li Tue, 26 Nov 2019 08:57:41 -0800

+1


> One particular negative effect has been that new postgresql tests add well
> over an hour to tests,


Adding postgresql tests is for improving the test coverage of Spark SQL. We
should continue to do this by importing more test cases. The quality of
Spark highly depends on the test coverage. We can further paralyze the test
execution to reduce the test time.

Migrating PostgreSQL workloads to Spark SQL


This should not be our current focus. In the near future, it is impossible
to be fully compatible with PostgreSQL. We should focus on adding features
that are useful to Spark community. PostgreSQL is a good reference, but we
do not need to blindly follow it. We already closed multiple related JIRAs
that try to add some PostgreSQL features that are not commonly used.

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <[email protected]>
wrote:

> I think it is important to distinguish between two different concepts:
>
>    - Adherence to standards and their well established implementations.
>    - Enabling migrations from some product X to Spark.
>
> While these two problems are related, there are independent and one can be
> achieved without the other.
>
>    - The former approach doesn't imply that all features of SQL standard
>    (or its specific implementation) are provided. It is sufficient that
>    commonly used features that are implemented, are standard compliant.
>    Therefore if end user applies some well known pattern, thing will work as
>    expected. I
>
>    In my personal opinion that's something that is worth the required
>    development resources, and in general should happen within the project.
>
>
>    - The latter one is more complicated. First of all the premise that
>    one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>    both Spark and PostgreSQL evolve, and probably have more in common today,
>    than a few years ago, they're not even close enough to pretend that one can
>    be replacement for the other. In contrast, existing compatibility layers
>    between major vendors make sense, because feature disparity (at least
>    when it comes to core functionality) is usually minimal. And that doesn't
>    even touch the problem that PostgreSQL provides extensively used extension
>    points that enable broad and evolving ecosystem (what should we do about
>    continuous queries? Should Structured Streaming provide some compatibility
>    layer as well?).
>
>    More realistically Spark could provide a compatibility layer with some
>    analytical tools that itself provide some PostgreSQL compatibility, but
>    these are not always fully compatible with upstream PostgreSQL, nor
>    necessarily follow the latest PostgreSQL development.
>
>    Furthermore compatibility layer can be, within certain limits (i.e.
>    availability of required primitives), maintained as a separate project,
>    without putting more strain on existing resources. Effectively what we care
>    about here is if we can translate certain SQL string into logical or
>    physical plan.
>
>
> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark and
> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
> Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
>     2.2 Spark's behavior makes sense but violates SQL standard: change the
> behavior to follow SQL standard and PostgreSQL, when the ansi mode is
> enabled (default false).
>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
> native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to migrate
> PostgreSQL workloads to Spark. Other databases have this strategy too. For
> example, DB2 provides an oracle dialect
> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and make
> the Spark codebase pretty complicated, but still not able to provide a
> usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement of
> migrating PostgreSQL workloads. I think it's much more important to make
> Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
> our own cast function is not ANSI-compliant yet. This makes me think that,
> we should do something to properly prioritize ANSI mode over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
> allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
> return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just update
> the answer files of PostgreSQL tests.
>
> Any comments are welcome!
>
> Thanks,
> Wenchen
>
> --
> Best regards,
> Maciej
>
>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: [DISCUSS] PostgreSQL dialect

Reply via email to