+1 (non-binding) Cheers, Fokko
Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <dongjoon.h...@gmail.com>: > +1 > > Bests, > Dongjoon. > > On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Yea, +1, that looks pretty reasonable to me. >> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >> from the codebase before it's too late. Curently we only have 3 features >> under PostgreSQL dialect: >> I personally think we could at least stop work about the Dialect until >> 3.0 released. >> >> >> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang < >> gengliang.w...@databricks.com> wrote: >> >>> +1 with the practical proposal. >>> To me, the major concern is that the code base becomes complicated, >>> while the PostgreSQL dialect has very limited features. I tried introducing >>> one big flag `spark.sql.dialect` and isolating related code in #25697 >>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be >>> clean. >>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI >>> mode, which can be confusing sometimes. >>> >>> Gengliang >>> >>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: >>> >>>> +1 >>>> >>>> >>>>> One particular negative effect has been that new postgresql tests add >>>>> well over an hour to tests, >>>> >>>> >>>> Adding postgresql tests is for improving the test coverage of Spark >>>> SQL. We should continue to do this by importing more test cases. The >>>> quality of Spark highly depends on the test coverage. We can further >>>> paralyze the test execution to reduce the test time. >>>> >>>> Migrating PostgreSQL workloads to Spark SQL >>>> >>>> >>>> This should not be our current focus. In the near future, it is >>>> impossible to be fully compatible with PostgreSQL. We should focus on >>>> adding features that are useful to Spark community. PostgreSQL is a good >>>> reference, but we do not need to blindly follow it. We already closed >>>> multiple related JIRAs that try to add some PostgreSQL features that are >>>> not commonly used. >>>> >>>> Cheers, >>>> >>>> Xiao >>>> >>>> >>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz < >>>> mszymkiew...@gmail.com> wrote: >>>> >>>>> I think it is important to distinguish between two different concepts: >>>>> >>>>> - Adherence to standards and their well established >>>>> implementations. >>>>> - Enabling migrations from some product X to Spark. >>>>> >>>>> While these two problems are related, there are independent and one >>>>> can be achieved without the other. >>>>> >>>>> - The former approach doesn't imply that all features of SQL >>>>> standard (or its specific implementation) are provided. It is >>>>> sufficient >>>>> that commonly used features that are implemented, are standard >>>>> compliant. >>>>> Therefore if end user applies some well known pattern, thing will work >>>>> as >>>>> expected. I >>>>> >>>>> In my personal opinion that's something that is worth the required >>>>> development resources, and in general should happen within the project. >>>>> >>>>> >>>>> - The latter one is more complicated. First of all the premise >>>>> that one can "migrate PostgreSQL workloads to Spark" seems to be >>>>> flawed. >>>>> While both Spark and PostgreSQL evolve, and probably have more in >>>>> common >>>>> today, than a few years ago, they're not even close enough to pretend >>>>> that >>>>> one can be replacement for the other. In contrast, existing >>>>> compatibility >>>>> layers between major vendors make sense, because feature disparity (at >>>>> least when it comes to core functionality) is usually minimal. And that >>>>> doesn't even touch the problem that PostgreSQL provides extensively >>>>> used >>>>> extension points that enable broad and evolving ecosystem (what should >>>>> we >>>>> do about continuous queries? Should Structured Streaming provide some >>>>> compatibility layer as well?). >>>>> >>>>> More realistically Spark could provide a compatibility layer with >>>>> some analytical tools that itself provide some PostgreSQL >>>>> compatibility, >>>>> but these are not always fully compatible with upstream PostgreSQL, nor >>>>> necessarily follow the latest PostgreSQL development. >>>>> >>>>> Furthermore compatibility layer can be, within certain limits >>>>> (i.e. availability of required primitives), maintained as a separate >>>>> project, without putting more strain on existing resources. Effectively >>>>> what we care about here is if we can translate certain SQL string into >>>>> logical or physical plan. >>>>> >>>>> >>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Recently we start an effort to achieve feature parity between Spark >>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 >>>>> >>>>> This goes very well. We've added many missing features(parser rules, >>>>> built-in functions, etc.) to Spark, and also corrected several >>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. >>>>> Many thanks to all the people that contribute to it! >>>>> >>>>> There are several cases when adding a PostgreSQL feature: >>>>> 1. Spark doesn't have this feature: just add it. >>>>> 2. Spark has this feature, but the behavior is different: >>>>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL >>>>> standard and PostgreSQL, with a legacy config to restore the behavior. >>>>> 2.2 Spark's behavior makes sense but violates SQL standard: change >>>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is >>>>> enabled (default false). >>>>> 2.3 Spark's behavior makes sense and doesn't violate SQL standard: >>>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is >>>>> Spark >>>>> native dialect). >>>>> >>>>> The PostgreSQL dialect itself is a good idea. It can help users to >>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy >>>>> too. For example, DB2 provides an oracle dialect >>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html> >>>>> . >>>>> >>>>> However, there are so many differences between Spark and PostgreSQL, >>>>> including SQL parsing, type coercion, function/operator behavior, data >>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make >>>>> the Spark codebase pretty complicated, but still not able to provide a >>>>> usable PostgreSQL dialect. >>>>> >>>>> Furthermore, it's not clear to me how many users have the requirement >>>>> of migrating PostgreSQL workloads. I think it's much more important to >>>>> make >>>>> Spark ANSI-compliant first, which doesn't need that much of work. >>>>> >>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, >>>>> while our own cast function is not ANSI-compliant yet. This makes me think >>>>> that, we should do something to properly prioritize ANSI mode over other >>>>> dialects. >>>>> >>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >>>>> from the codebase before it's too late. Curently we only have 3 features >>>>> under PostgreSQL dialect: >>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are >>>>> also allowed as true string. >>>>> 2. `date - date` returns interval in Spark (SQL standard behavior), >>>>> but return int in PostgreSQL >>>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL. >>>>> (there is no standard) >>>>> >>>>> We should still add PostgreSQL features that Spark doesn't have, or >>>>> Spark's behavior violates SQL standard. But for others, let's just update >>>>> the answer files of PostgreSQL tests. >>>>> >>>>> Any comments are welcome! >>>>> >>>>> Thanks, >>>>> Wenchen >>>>> >>>>> -- >>>>> Best regards, >>>>> Maciej >>>>> >>>>> >>>> >>>> -- >>>> [image: Databricks Summit - Watch the talks] >>>> <https://databricks.com/sparkaisummit/north-america> >>>> >>> >> >> -- >> --- >> Takeshi Yamamuro >> >