+1 Bests, Dongjoon.
On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <linguin....@gmail.com> wrote: > Yea, +1, that looks pretty reasonable to me. > > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it > from the codebase before it's too late. Curently we only have 3 features > under PostgreSQL dialect: > I personally think we could at least stop work about the Dialect until 3.0 > released. > > > On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang < > gengliang.w...@databricks.com> wrote: > >> +1 with the practical proposal. >> To me, the major concern is that the code base becomes complicated, while >> the PostgreSQL dialect has very limited features. I tried introducing one >> big flag `spark.sql.dialect` and isolating related code in #25697 >> <https://github.com/apache/spark/pull/25697>, but it seems hard to be >> clean. >> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI >> mode, which can be confusing sometimes. >> >> Gengliang >> >> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lix...@databricks.com> wrote: >> >>> +1 >>> >>> >>>> One particular negative effect has been that new postgresql tests add >>>> well over an hour to tests, >>> >>> >>> Adding postgresql tests is for improving the test coverage of Spark SQL. >>> We should continue to do this by importing more test cases. The quality of >>> Spark highly depends on the test coverage. We can further paralyze the test >>> execution to reduce the test time. >>> >>> Migrating PostgreSQL workloads to Spark SQL >>> >>> >>> This should not be our current focus. In the near future, it is >>> impossible to be fully compatible with PostgreSQL. We should focus on >>> adding features that are useful to Spark community. PostgreSQL is a good >>> reference, but we do not need to blindly follow it. We already closed >>> multiple related JIRAs that try to add some PostgreSQL features that are >>> not commonly used. >>> >>> Cheers, >>> >>> Xiao >>> >>> >>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz < >>> mszymkiew...@gmail.com> wrote: >>> >>>> I think it is important to distinguish between two different concepts: >>>> >>>> - Adherence to standards and their well established implementations. >>>> - Enabling migrations from some product X to Spark. >>>> >>>> While these two problems are related, there are independent and one can >>>> be achieved without the other. >>>> >>>> - The former approach doesn't imply that all features of SQL >>>> standard (or its specific implementation) are provided. It is sufficient >>>> that commonly used features that are implemented, are standard >>>> compliant. >>>> Therefore if end user applies some well known pattern, thing will work >>>> as >>>> expected. I >>>> >>>> In my personal opinion that's something that is worth the required >>>> development resources, and in general should happen within the project. >>>> >>>> >>>> - The latter one is more complicated. First of all the premise that >>>> one can "migrate PostgreSQL workloads to Spark" seems to be flawed. >>>> While >>>> both Spark and PostgreSQL evolve, and probably have more in common >>>> today, >>>> than a few years ago, they're not even close enough to pretend that one >>>> can >>>> be replacement for the other. In contrast, existing compatibility layers >>>> between major vendors make sense, because feature disparity (at >>>> least when it comes to core functionality) is usually minimal. And that >>>> doesn't even touch the problem that PostgreSQL provides extensively used >>>> extension points that enable broad and evolving ecosystem (what should >>>> we >>>> do about continuous queries? Should Structured Streaming provide some >>>> compatibility layer as well?). >>>> >>>> More realistically Spark could provide a compatibility layer with >>>> some analytical tools that itself provide some PostgreSQL compatibility, >>>> but these are not always fully compatible with upstream PostgreSQL, nor >>>> necessarily follow the latest PostgreSQL development. >>>> >>>> Furthermore compatibility layer can be, within certain limits (i.e. >>>> availability of required primitives), maintained as a separate project, >>>> without putting more strain on existing resources. Effectively what we >>>> care >>>> about here is if we can translate certain SQL string into logical or >>>> physical plan. >>>> >>>> >>>> On 11/26/19 3:26 PM, Wenchen Fan wrote: >>>> >>>> Hi all, >>>> >>>> Recently we start an effort to achieve feature parity between Spark and >>>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 >>>> >>>> This goes very well. We've added many missing features(parser rules, >>>> built-in functions, etc.) to Spark, and also corrected several >>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL. >>>> Many thanks to all the people that contribute to it! >>>> >>>> There are several cases when adding a PostgreSQL feature: >>>> 1. Spark doesn't have this feature: just add it. >>>> 2. Spark has this feature, but the behavior is different: >>>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL >>>> standard and PostgreSQL, with a legacy config to restore the behavior. >>>> 2.2 Spark's behavior makes sense but violates SQL standard: change >>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is >>>> enabled (default false). >>>> 2.3 Spark's behavior makes sense and doesn't violate SQL standard: >>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark >>>> native dialect). >>>> >>>> The PostgreSQL dialect itself is a good idea. It can help users to >>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy >>>> too. For example, DB2 provides an oracle dialect >>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html> >>>> . >>>> >>>> However, there are so many differences between Spark and PostgreSQL, >>>> including SQL parsing, type coercion, function/operator behavior, data >>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make >>>> the Spark codebase pretty complicated, but still not able to provide a >>>> usable PostgreSQL dialect. >>>> >>>> Furthermore, it's not clear to me how many users have the requirement >>>> of migrating PostgreSQL workloads. I think it's much more important to make >>>> Spark ANSI-compliant first, which doesn't need that much of work. >>>> >>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while >>>> our own cast function is not ANSI-compliant yet. This makes me think that, >>>> we should do something to properly prioritize ANSI mode over other >>>> dialects. >>>> >>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it >>>> from the codebase before it's too late. Curently we only have 3 features >>>> under PostgreSQL dialect: >>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also >>>> allowed as true string. >>>> 2. `date - date` returns interval in Spark (SQL standard behavior), >>>> but return int in PostgreSQL >>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL. >>>> (there is no standard) >>>> >>>> We should still add PostgreSQL features that Spark doesn't have, or >>>> Spark's behavior violates SQL standard. But for others, let's just update >>>> the answer files of PostgreSQL tests. >>>> >>>> Any comments are welcome! >>>> >>>> Thanks, >>>> Wenchen >>>> >>>> -- >>>> Best regards, >>>> Maciej >>>> >>>> >>> >>> -- >>> [image: Databricks Summit - Watch the talks] >>> <https://databricks.com/sparkaisummit/north-america> >>> >> > > -- > --- > Takeshi Yamamuro >