Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
This plan for evolving the TRIM function to be more standards compliant sounds much better to me than the original change to just switch the order. It pushes users in the right direction and cleans up our tech debt without silently breaking existing workloads. It means that programs won't return different results when run on Spark 2.x and Spark 3.x. One caveat: > If we keep this situation in 3.0.0 release (a major release), it means >>> Apache Spark will be forever >>> >> I think this ship has already sailed. There is nothing special about 3.0 here. If the API is in a released version of Spark, than the mistake is already made. Major releases are an opportunity to break APIs when we *have to*. We always strive to avoid breaking APIs even if they have not been in an X.0 release.
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Yes, right. We need to choose the best way case by case. The following level of `Runtime warning` also sounds reasonable to me. > Maybe we should only log the warning once per Spark application. Bests, Dongjoon. On Tue, Feb 18, 2020 at 1:13 AM Wenchen Fan wrote: > I don't know what's the best way to deprecate an SQL function. Runtime > warning can be annoying if it keeps coming out. Maybe we should only log > the warning once per Spark application. > > On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun > wrote: > >> Thank you for feedback, Wenchen, Maxim, and Takeshi. >> >> 1. "we would also promote the SQL standard TRIM syntax" >> 2. "we could output a warning with a notice about the order of >> parameters" >> 3. "it looks nice to make these (two parameters) trim functions >> unrecommended in future releases" >> >> Yes, in case of reverting SPARK-28093, we had better deprecate these >> two-parameter SQL function invocations. It's because this is really >> esoteric and even inconsistent inside Spark (Scala API also follows >> PostgresSQL/Presto style like the following.) >> >> def trim(e: Column, trimString: String) >> >> If we keep this situation in 3.0.0 release (a major release), it means >> Apache Spark will be forever. >> And, it becomes worse and worse because it leads more users to fall into >> this trap. >> >> There are a few deprecation ways. I believe (3) can be a proper next step >> in case of reverting because (2) is infeasible and (1) is considered >> insufficient because it's silent when we do SPARK-28093. We need non-silent >> (noisy) one in this case. Technically, (3) can be done in >> `Analyzer.ResolveFunctions`. >> >> 1. Documentation-only: Removing example and add migration guide >> 2. Compile-time warning by annotation: This is not an option for SQL >> function in SQL string. >> 3. Runtime warning with a directional guide >>(log.warn("... USE TRIM(trimStr FROM str) INSTEAD") >> >> How do you think about (3)? >> >> Bests, >> Dongjoon. >> >> On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro >> wrote: >> >>> The revert looks fine to me for keeping the compatibility. >>> Also, I think the different orders between the systems easily lead to >>> mistakes, so >>> , as Wenchen suggested, it looks nice to make these (two parameters) >>> trim functions >>> unrecommended in future releases: >>> https://github.com/apache/spark/pull/27540#discussion_r377682518 >>> Actually, I think the SQL TRIM syntax is enough for trim use cases... >>> >>> Bests, >>> Takeshi >>> >>> >>> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk >>> wrote: >>> Also if we look at possible combination of trim parameters: 1. foldable srcStr + foldable trimStr 2. foldable srcStr + non-foldable trimStr 3. non-foldable srcStr + foldable trimStr 4. non-foldable srcStr + non-foldable trimStr The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters. Maxim Gekk Software Engineer Databricks, Inc. On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan wrote: > It's unfortunate that we don't have a clear document to talk about > breaking changes (I'm working on it BTW). I believe the general guidance > is: *avoid breaking changes unless we have to*. For example, the > previous result was so broken that we have to fix it, moving to Scala 2.12 > makes us have to break some APIs, etc. > > For this particular case, do we have to switch the parameter order? > It's different from some systems, the order was not decided explicitly, > but > I don't think they are strong reasons. People from RDBMS should use the > SQL > standard TRIM syntax more often. People using prior Spark versions should > have figured out the parameter order of Spark TRIM (there was no document) > and adjusted their queries. There is no such standard that defines the > parameter order of the TRIM function. > > In the long term, we would also promote the SQL standard TRIM syntax. > I don't see much benefit of "fixing" the parameter order that worth to > make > a breaking change. > > Thanks, > Wenchen > > > > On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun > wrote: > >> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters >> and TRIM(trimStr FROM str) syntax. >> >> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. >> >> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun < >> dongjoon.h...@gmail.com> wrote: >> >>> Hi, All. >>> >>> I'm sending this email because the Apache Spark committers had >>> better have a consistent point of views for the upcoming PRs. And, the >>> community policy is the way to lead the community members transparently >>> and
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
I don't know what's the best way to deprecate an SQL function. Runtime warning can be annoying if it keeps coming out. Maybe we should only log the warning once per Spark application. On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun wrote: > Thank you for feedback, Wenchen, Maxim, and Takeshi. > > 1. "we would also promote the SQL standard TRIM syntax" > 2. "we could output a warning with a notice about the order of > parameters" > 3. "it looks nice to make these (two parameters) trim functions > unrecommended in future releases" > > Yes, in case of reverting SPARK-28093, we had better deprecate these > two-parameter SQL function invocations. It's because this is really > esoteric and even inconsistent inside Spark (Scala API also follows > PostgresSQL/Presto style like the following.) > > def trim(e: Column, trimString: String) > > If we keep this situation in 3.0.0 release (a major release), it means > Apache Spark will be forever. > And, it becomes worse and worse because it leads more users to fall into > this trap. > > There are a few deprecation ways. I believe (3) can be a proper next step > in case of reverting because (2) is infeasible and (1) is considered > insufficient because it's silent when we do SPARK-28093. We need non-silent > (noisy) one in this case. Technically, (3) can be done in > `Analyzer.ResolveFunctions`. > > 1. Documentation-only: Removing example and add migration guide > 2. Compile-time warning by annotation: This is not an option for SQL > function in SQL string. > 3. Runtime warning with a directional guide >(log.warn("... USE TRIM(trimStr FROM str) INSTEAD") > > How do you think about (3)? > > Bests, > Dongjoon. > > On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro > wrote: > >> The revert looks fine to me for keeping the compatibility. >> Also, I think the different orders between the systems easily lead to >> mistakes, so >> , as Wenchen suggested, it looks nice to make these (two parameters) trim >> functions >> unrecommended in future releases: >> https://github.com/apache/spark/pull/27540#discussion_r377682518 >> Actually, I think the SQL TRIM syntax is enough for trim use cases... >> >> Bests, >> Takeshi >> >> >> On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk >> wrote: >> >>> Also if we look at possible combination of trim parameters: >>> 1. foldable srcStr + foldable trimStr >>> 2. foldable srcStr + non-foldable trimStr >>> 3. non-foldable srcStr + foldable trimStr >>> 4. non-foldable srcStr + non-foldable trimStr >>> >>> The case # 2 seems a rare case, and # 3 is probably the most common >>> case. Once we see the second case, we could output a warning with a notice >>> about the order of parameters. >>> >>> Maxim Gekk >>> >>> Software Engineer >>> >>> Databricks, Inc. >>> >>> >>> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan wrote: >>> It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: *avoid breaking changes unless we have to*. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc. For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function. In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change. Thanks, Wenchen On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun wrote: > Please note that the context if TRIM/LTRIM/RTRIM with two-parameters > and TRIM(trimStr FROM str) syntax. > > This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. > > On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun < > dongjoon.h...@gmail.com> wrote: > >> Hi, All. >> >> I'm sending this email because the Apache Spark committers had better >> have a consistent point of views for the upcoming PRs. And, the community >> policy is the way to lead the community members transparently and clearly >> for a long term good. >> >> First of all, I want to emphasize that, like Apache Spark 2.0.0, >> Apache Spark 3.0.0 is going to achieve many improvement in SQL areas. >> >> Especially, we invested lots of efforts to improve SQL ANSI >> compliance and related test coverage for the future. One simple example >> is >> the following. >> >> [SPARK-28126][SQL] Support
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Thank you for feedback, Wenchen, Maxim, and Takeshi. 1. "we would also promote the SQL standard TRIM syntax" 2. "we could output a warning with a notice about the order of parameters" 3. "it looks nice to make these (two parameters) trim functions unrecommended in future releases" Yes, in case of reverting SPARK-28093, we had better deprecate these two-parameter SQL function invocations. It's because this is really esoteric and even inconsistent inside Spark (Scala API also follows PostgresSQL/Presto style like the following.) def trim(e: Column, trimString: String) If we keep this situation in 3.0.0 release (a major release), it means Apache Spark will be forever. And, it becomes worse and worse because it leads more users to fall into this trap. There are a few deprecation ways. I believe (3) can be a proper next step in case of reverting because (2) is infeasible and (1) is considered insufficient because it's silent when we do SPARK-28093. We need non-silent (noisy) one in this case. Technically, (3) can be done in `Analyzer.ResolveFunctions`. 1. Documentation-only: Removing example and add migration guide 2. Compile-time warning by annotation: This is not an option for SQL function in SQL string. 3. Runtime warning with a directional guide (log.warn("... USE TRIM(trimStr FROM str) INSTEAD") How do you think about (3)? Bests, Dongjoon. On Sun, Feb 16, 2020 at 1:22 AM Takeshi Yamamuro wrote: > The revert looks fine to me for keeping the compatibility. > Also, I think the different orders between the systems easily lead to > mistakes, so > , as Wenchen suggested, it looks nice to make these (two parameters) trim > functions > unrecommended in future releases: > https://github.com/apache/spark/pull/27540#discussion_r377682518 > Actually, I think the SQL TRIM syntax is enough for trim use cases... > > Bests, > Takeshi > > > On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk > wrote: > >> Also if we look at possible combination of trim parameters: >> 1. foldable srcStr + foldable trimStr >> 2. foldable srcStr + non-foldable trimStr >> 3. non-foldable srcStr + foldable trimStr >> 4. non-foldable srcStr + non-foldable trimStr >> >> The case # 2 seems a rare case, and # 3 is probably the most common case. >> Once we see the second case, we could output a warning with a notice about >> the order of parameters. >> >> Maxim Gekk >> >> Software Engineer >> >> Databricks, Inc. >> >> >> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan wrote: >> >>> It's unfortunate that we don't have a clear document to talk about >>> breaking changes (I'm working on it BTW). I believe the general guidance >>> is: *avoid breaking changes unless we have to*. For example, the >>> previous result was so broken that we have to fix it, moving to Scala 2.12 >>> makes us have to break some APIs, etc. >>> >>> For this particular case, do we have to switch the parameter order? It's >>> different from some systems, the order was not decided explicitly, but I >>> don't think they are strong reasons. People from RDBMS should use the SQL >>> standard TRIM syntax more often. People using prior Spark versions should >>> have figured out the parameter order of Spark TRIM (there was no document) >>> and adjusted their queries. There is no such standard that defines the >>> parameter order of the TRIM function. >>> >>> In the long term, we would also promote the SQL standard TRIM syntax. I >>> don't see much benefit of "fixing" the parameter order that worth to make a >>> breaking change. >>> >>> Thanks, >>> Wenchen >>> >>> >>> >>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun >>> wrote: >>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax. This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun wrote: > Hi, All. > > I'm sending this email because the Apache Spark committers had better > have a consistent point of views for the upcoming PRs. And, the community > policy is the way to lead the community members transparently and clearly > for a long term good. > > First of all, I want to emphasize that, like Apache Spark 2.0.0, > Apache Spark 3.0.0 is going to achieve many improvement in SQL areas. > > Especially, we invested lots of efforts to improve SQL ANSI compliance > and related test coverage for the future. One simple example is the > following. > > [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax > > With this support, we are able to move away from one of esoteric Spark > function `TRIM/LTRIM/RTRIM`. > (Note that the above syntax is ANSI standard, but we have additional > non-standard these functions, too.) > > Here is the summary of the current status. > > ++-+---+ > | SQL Progressing Engine |
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
The revert looks fine to me for keeping the compatibility. Also, I think the different orders between the systems easily lead to mistakes, so , as Wenchen suggested, it looks nice to make these (two parameters) trim functions unrecommended in future releases: https://github.com/apache/spark/pull/27540#discussion_r377682518 Actually, I think the SQL TRIM syntax is enough for trim use cases... Bests, Takeshi On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk wrote: > Also if we look at possible combination of trim parameters: > 1. foldable srcStr + foldable trimStr > 2. foldable srcStr + non-foldable trimStr > 3. non-foldable srcStr + foldable trimStr > 4. non-foldable srcStr + non-foldable trimStr > > The case # 2 seems a rare case, and # 3 is probably the most common case. > Once we see the second case, we could output a warning with a notice about > the order of parameters. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan wrote: > >> It's unfortunate that we don't have a clear document to talk about >> breaking changes (I'm working on it BTW). I believe the general guidance >> is: *avoid breaking changes unless we have to*. For example, the >> previous result was so broken that we have to fix it, moving to Scala 2.12 >> makes us have to break some APIs, etc. >> >> For this particular case, do we have to switch the parameter order? It's >> different from some systems, the order was not decided explicitly, but I >> don't think they are strong reasons. People from RDBMS should use the SQL >> standard TRIM syntax more often. People using prior Spark versions should >> have figured out the parameter order of Spark TRIM (there was no document) >> and adjusted their queries. There is no such standard that defines the >> parameter order of the TRIM function. >> >> In the long term, we would also promote the SQL standard TRIM syntax. I >> don't see much benefit of "fixing" the parameter order that worth to make a >> breaking change. >> >> Thanks, >> Wenchen >> >> >> >> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun >> wrote: >> >>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and >>> TRIM(trimStr FROM str) syntax. >>> >>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. >>> >>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun >>> wrote: >>> Hi, All. I'm sending this email because the Apache Spark committers had better have a consistent point of views for the upcoming PRs. And, the community policy is the way to lead the community members transparently and clearly for a long term good. First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache Spark 3.0.0 is going to achieve many improvement in SQL areas. Especially, we invested lots of efforts to improve SQL ANSI compliance and related test coverage for the future. One simple example is the following. [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax With this support, we are able to move away from one of esoteric Spark function `TRIM/LTRIM/RTRIM`. (Note that the above syntax is ANSI standard, but we have additional non-standard these functions, too.) Here is the summary of the current status. ++-+---+ | SQL Progressing Engine | TRIM Syntax | TRIM Function | +--+ | Spark 3.0.0-preview/2 | O | O | | PostgreSQL | O | O | | Presto | O | O | | MySQL | O(*) | X | | Oracle | O | X | | MsSQL | O | X | | Terradata | O | X | | Hive | X | X | | Spark 1.6 ~ 2.2| X | X | | Spark 2.3 ~ 2.4| X | O(**) | ++-+---+ * MySQL doesn't allow multiple trim character * Spark 2.3 ~ 2.4 have the function in a different way. Here is the illustrative example of the problem. postgres=# SELECT trim('yxTomxx', 'xyz'); btrim --- Tom presto:default> SELECT trim('yxTomxx', 'xyz'); _col0 --- Tom spark-sql> SELECT trim('yxTomxx', 'xyz'); z Here is our history to fix the above issue. [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue 1. https://github.com/apache/spark/pull/24902 (Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 released.)
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Also if we look at possible combination of trim parameters: 1. foldable srcStr + foldable trimStr 2. foldable srcStr + non-foldable trimStr 3. non-foldable srcStr + foldable trimStr 4. non-foldable srcStr + non-foldable trimStr The case # 2 seems a rare case, and # 3 is probably the most common case. Once we see the second case, we could output a warning with a notice about the order of parameters. Maxim Gekk Software Engineer Databricks, Inc. On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan wrote: > It's unfortunate that we don't have a clear document to talk about > breaking changes (I'm working on it BTW). I believe the general guidance > is: *avoid breaking changes unless we have to*. For example, the previous > result was so broken that we have to fix it, moving to Scala 2.12 makes us > have to break some APIs, etc. > > For this particular case, do we have to switch the parameter order? It's > different from some systems, the order was not decided explicitly, but I > don't think they are strong reasons. People from RDBMS should use the SQL > standard TRIM syntax more often. People using prior Spark versions should > have figured out the parameter order of Spark TRIM (there was no document) > and adjusted their queries. There is no such standard that defines the > parameter order of the TRIM function. > > In the long term, we would also promote the SQL standard TRIM syntax. I > don't see much benefit of "fixing" the parameter order that worth to make a > breaking change. > > Thanks, > Wenchen > > > > On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun > wrote: > >> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and >> TRIM(trimStr FROM str) syntax. >> >> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. >> >> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun >> wrote: >> >>> Hi, All. >>> >>> I'm sending this email because the Apache Spark committers had better >>> have a consistent point of views for the upcoming PRs. And, the community >>> policy is the way to lead the community members transparently and clearly >>> for a long term good. >>> >>> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache >>> Spark 3.0.0 is going to achieve many improvement in SQL areas. >>> >>> Especially, we invested lots of efforts to improve SQL ANSI compliance >>> and related test coverage for the future. One simple example is the >>> following. >>> >>> [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax >>> >>> With this support, we are able to move away from one of esoteric Spark >>> function `TRIM/LTRIM/RTRIM`. >>> (Note that the above syntax is ANSI standard, but we have additional >>> non-standard these functions, too.) >>> >>> Here is the summary of the current status. >>> >>> ++-+---+ >>> | SQL Progressing Engine | TRIM Syntax | TRIM Function | >>> +--+ >>> | Spark 3.0.0-preview/2 | O | O | >>> | PostgreSQL | O | O | >>> | Presto | O | O | >>> | MySQL | O(*) | X | >>> | Oracle | O | X | >>> | MsSQL | O | X | >>> | Terradata | O | X | >>> | Hive | X | X | >>> | Spark 1.6 ~ 2.2| X | X | >>> | Spark 2.3 ~ 2.4| X | O(**) | >>> ++-+---+ >>> * MySQL doesn't allow multiple trim character >>> * Spark 2.3 ~ 2.4 have the function in a different way. >>> >>> Here is the illustrative example of the problem. >>> >>> postgres=# SELECT trim('yxTomxx', 'xyz'); >>> btrim >>> --- >>> Tom >>> >>> presto:default> SELECT trim('yxTomxx', 'xyz'); >>> _col0 >>> --- >>> Tom >>> >>> spark-sql> SELECT trim('yxTomxx', 'xyz'); >>> z >>> >>> Here is our history to fix the above issue. >>> >>> [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order >>> issue >>> 1. https://github.com/apache/spark/pull/24902 >>>(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 >>> released.) >>> 2. https://github.com/apache/spark/pull/24907 >>>(Merged 2019-06-20 for v2.3.4, but reverted) >>> 3. https://github.com/apache/spark/pull/24908 >>>(Merged 2019-06-21 for v2.4.4, but reverted) >>> >>> (2) and (3) were reverted before releases because we didn't want to fix >>> that in the maintenance releases. Please see the following references of >>> the decision. >>> >>> https://github.com/apache/spark/pull/24908#issuecomment-504799028 >>> (2.3) >>> https://github.com/apache/spark/pull/24907#issuecomment-504799021 >>> (2.4) >>> >>> Now, there are some
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: *avoid breaking changes unless we have to*. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to break some APIs, etc. For this particular case, do we have to switch the parameter order? It's different from some systems, the order was not decided explicitly, but I don't think they are strong reasons. People from RDBMS should use the SQL standard TRIM syntax more often. People using prior Spark versions should have figured out the parameter order of Spark TRIM (there was no document) and adjusted their queries. There is no such standard that defines the parameter order of the TRIM function. In the long term, we would also promote the SQL standard TRIM syntax. I don't see much benefit of "fixing" the parameter order that worth to make a breaking change. Thanks, Wenchen On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun wrote: > Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and > TRIM(trimStr FROM str) syntax. > > This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. > > On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun > wrote: > >> Hi, All. >> >> I'm sending this email because the Apache Spark committers had better >> have a consistent point of views for the upcoming PRs. And, the community >> policy is the way to lead the community members transparently and clearly >> for a long term good. >> >> First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache >> Spark 3.0.0 is going to achieve many improvement in SQL areas. >> >> Especially, we invested lots of efforts to improve SQL ANSI compliance >> and related test coverage for the future. One simple example is the >> following. >> >> [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax >> >> With this support, we are able to move away from one of esoteric Spark >> function `TRIM/LTRIM/RTRIM`. >> (Note that the above syntax is ANSI standard, but we have additional >> non-standard these functions, too.) >> >> Here is the summary of the current status. >> >> ++-+---+ >> | SQL Progressing Engine | TRIM Syntax | TRIM Function | >> +--+ >> | Spark 3.0.0-preview/2 | O | O | >> | PostgreSQL | O | O | >> | Presto | O | O | >> | MySQL | O(*) | X | >> | Oracle | O | X | >> | MsSQL | O | X | >> | Terradata | O | X | >> | Hive | X | X | >> | Spark 1.6 ~ 2.2| X | X | >> | Spark 2.3 ~ 2.4| X | O(**) | >> ++-+---+ >> * MySQL doesn't allow multiple trim character >> * Spark 2.3 ~ 2.4 have the function in a different way. >> >> Here is the illustrative example of the problem. >> >> postgres=# SELECT trim('yxTomxx', 'xyz'); >> btrim >> --- >> Tom >> >> presto:default> SELECT trim('yxTomxx', 'xyz'); >> _col0 >> --- >> Tom >> >> spark-sql> SELECT trim('yxTomxx', 'xyz'); >> z >> >> Here is our history to fix the above issue. >> >> [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue >> 1. https://github.com/apache/spark/pull/24902 >>(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 >> released.) >> 2. https://github.com/apache/spark/pull/24907 >>(Merged 2019-06-20 for v2.3.4, but reverted) >> 3. https://github.com/apache/spark/pull/24908 >>(Merged 2019-06-21 for v2.4.4, but reverted) >> >> (2) and (3) were reverted before releases because we didn't want to fix >> that in the maintenance releases. Please see the following references of >> the decision. >> >> https://github.com/apache/spark/pull/24908#issuecomment-504799028 >> (2.3) >> https://github.com/apache/spark/pull/24907#issuecomment-504799021 >> (2.4) >> >> Now, there are some requests to revert SPARK-28093 and to keep these >> esoteric functions for backward compatibility and the following reason. >> >> > Reordering function parameters to match another system, >> > for a method that is otherwise working correctly, >> > sounds exactly like a cosmetic change to me. >> >> > How can we silently change the parameter of an existing SQL >> function? >> > I don't think this is a correctness issue as the SQL standard >> > doesn't say that the function signature have to be trim(srcStr, >> trimStr). >> >> The concern and the point of views make sense. >> >> My concerns are the followings. >> >>
Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`
Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and TRIM(trimStr FROM str) syntax. This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM. On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun wrote: > Hi, All. > > I'm sending this email because the Apache Spark committers had better have > a consistent point of views for the upcoming PRs. And, the community policy > is the way to lead the community members transparently and clearly for a > long term good. > > First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache > Spark 3.0.0 is going to achieve many improvement in SQL areas. > > Especially, we invested lots of efforts to improve SQL ANSI compliance and > related test coverage for the future. One simple example is the following. > > [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax > > With this support, we are able to move away from one of esoteric Spark > function `TRIM/LTRIM/RTRIM`. > (Note that the above syntax is ANSI standard, but we have additional > non-standard these functions, too.) > > Here is the summary of the current status. > > ++-+---+ > | SQL Progressing Engine | TRIM Syntax | TRIM Function | > +--+ > | Spark 3.0.0-preview/2 | O | O | > | PostgreSQL | O | O | > | Presto | O | O | > | MySQL | O(*) | X | > | Oracle | O | X | > | MsSQL | O | X | > | Terradata | O | X | > | Hive | X | X | > | Spark 1.6 ~ 2.2| X | X | > | Spark 2.3 ~ 2.4| X | O(**) | > ++-+---+ > * MySQL doesn't allow multiple trim character > * Spark 2.3 ~ 2.4 have the function in a different way. > > Here is the illustrative example of the problem. > > postgres=# SELECT trim('yxTomxx', 'xyz'); > btrim > --- > Tom > > presto:default> SELECT trim('yxTomxx', 'xyz'); > _col0 > --- > Tom > > spark-sql> SELECT trim('yxTomxx', 'xyz'); > z > > Here is our history to fix the above issue. > > [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue > 1. https://github.com/apache/spark/pull/24902 >(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2 > released.) > 2. https://github.com/apache/spark/pull/24907 >(Merged 2019-06-20 for v2.3.4, but reverted) > 3. https://github.com/apache/spark/pull/24908 >(Merged 2019-06-21 for v2.4.4, but reverted) > > (2) and (3) were reverted before releases because we didn't want to fix > that in the maintenance releases. Please see the following references of > the decision. > > https://github.com/apache/spark/pull/24908#issuecomment-504799028 > (2.3) > https://github.com/apache/spark/pull/24907#issuecomment-504799021 > (2.4) > > Now, there are some requests to revert SPARK-28093 and to keep these > esoteric functions for backward compatibility and the following reason. > > > Reordering function parameters to match another system, > > for a method that is otherwise working correctly, > > sounds exactly like a cosmetic change to me. > > > How can we silently change the parameter of an existing SQL function? > > I don't think this is a correctness issue as the SQL standard > > doesn't say that the function signature have to be trim(srcStr, > trimStr). > > The concern and the point of views make sense. > > My concerns are the followings. > > 1. This kind of esoteric differences are called `vendor-lock-in` > stuffs in a negative way. >- It's difficult for new users to understand. >- It's hard to migrate between Apache Spark and the others. > 2. Although we did our bests, Apache Spark SQL has not been enough > always. > 3. We need to do improvement in the future releases. > > In short, we can keep 3.0.0-preview behaviors here or revert SPARK-28093 > in order to keep this vendor-lock-in stuffs for backward-compatibility. > > What I want is building a consistent point of views in this category for > the upcoming PR reviews. > > If we have a clear policy, we can save future community efforts in many > ways. > > Bests, > Dongjoon. >