Thank you for sharing and confirming. We had better consider all heterogeneous customers in the world. And, I also have experiences with the non-negligible cases in on-prem.
Bests, Dongjoon. On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <r...@databricks.com> wrote: > −User > > char barely showed up (honestly negligible). I was comparing select vs > select. > > > > On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Ur, are you comparing the number of SELECT statement with TRIM and CREATE >> statements with `CHAR`? >> >> > I looked up our usage logs (sorry I can't share this publicly) and trim >> has at least four orders of magnitude higher usage than char. >> >> We need to discuss more about what to do. This thread is what I expected >> exactly. :) >> >> > BTW I'm not opposing us sticking to SQL standard (I'm in general for >> it). I was merely pointing out that if we deviate away from SQL standard in >> any way we are considered "wrong" or "incorrect". That argument itself is >> flawed when plenty of other popular database systems also deviate away from >> the standard on this specific behavior. >> >> Bests, >> Dongjoon. >> >> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <r...@databricks.com> wrote: >> >>> BTW I'm not opposing us sticking to SQL standard (I'm in general for >>> it). I was merely pointing out that if we deviate away from SQL standard in >>> any way we are considered "wrong" or "incorrect". That argument itself is >>> flawed when plenty of other popular database systems also deviate away from >>> the standard on this specific behavior. >>> >>> >>> >>> >>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I looked up our usage logs (sorry I can't share this publicly) and trim >>>> has at least four orders of magnitude higher usage than char. >>>> >>>> >>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <dongjoon.h...@gmail.com >>>> > wrote: >>>> >>>>> Thank you, Stephen and Reynold. >>>>> >>>>> To Reynold. >>>>> >>>>> The way I see the following is a little different. >>>>> >>>>> > CHAR is an undocumented data type without clearly defined >>>>> semantics. >>>>> >>>>> Let me describe in Apache Spark User's View point. >>>>> >>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` >>>>> function) at Apache Spark 1.x without much documentation. In addition, >>>>> there still exists an effort which is trying to keep it in 3.0.0 age. >>>>> >>>>> https://issues.apache.org/jira/browse/SPARK-31088 >>>>> Add back HiveContext and createExternalTable >>>>> >>>>> Historically, we tried to make many SQL-based customer migrate their >>>>> workloads from Apache Hive into Apache Spark through `HiveContext`. >>>>> >>>>> Although Apache Spark didn't have a good document about the >>>>> inconsistent behavior among its data sources, Apache Hive has been >>>>> providing its documentation and many customers rely the behavior. >>>>> >>>>> - >>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types >>>>> >>>>> At that time, frequently in on-prem Hadoop clusters by well-known >>>>> vendors, many existing huge tables were created by Apache Hive, not Apache >>>>> Spark. And, Apache Spark is used for boosting SQL performance with its >>>>> *caching*. This was true because Apache Spark was added into the >>>>> Hadoop-vendor products later than Apache Hive. >>>>> >>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more >>>>> features to be consistent at least with Hive tables in Apache Hive and >>>>> Apache Spark because two SQL engines share the same tables. >>>>> >>>>> For the following, technically, while Apache Hive doesn't changed its >>>>> existing behavior in this part, Apache Spark evolves inevitably by moving >>>>> away from the original Apache Spark old behaviors one-by-one. >>>>> >>>>> > the value is already fucked up >>>>> >>>>> The following is the change log. >>>>> >>>>> - When we switched the default value of >>>>> `convertMetastoreParquet`. (at Apache Spark 1.2) >>>>> - When we switched the default value of `convertMetastoreOrc` >>>>> (at Apache Spark 2.4) >>>>> - When we switched `CREATE TABLE` itself. (Change `TEXT` table >>>>> to `PARQUET` table at Apache Spark 3.0) >>>>> >>>>> To sum up, this has been a well-known issue in the community and among >>>>> the customers. >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <s...@infomedia.com.au> >>>>> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I’m kind of new around here, but I have had experience with all of >>>>>> all the so called “big iron” databases such as Oracle, IBM DB2 and >>>>>> Microsoft SQL Server as well as Postgresql. >>>>>> >>>>>> They all support the notion of “ANSI padding” for CHAR columns - >>>>>> which means that such columns are always space padded, and they default >>>>>> to >>>>>> having this enabled (for ANSI compliance). >>>>>> >>>>>> MySQL also supports it, but it defaults to leaving it disabled for >>>>>> historical reasons not unlike what we have here. >>>>>> >>>>>> In my opinion we should push toward standards compliance where >>>>>> possible and then document where it cannot work. >>>>>> >>>>>> If users don’t like the padding on CHAR columns then they should >>>>>> change to VARCHAR - I believe that was its purpose in the first place, >>>>>> and >>>>>> it does not dictate any sort of “padding". >>>>>> >>>>>> I can see why you might “ban” the use of CHAR columns where they >>>>>> cannot be consistently supported, but VARCHAR is a different animal and I >>>>>> would expect it to work consistently everywhere. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Steve C >>>>>> >>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi, Reynold. >>>>>> (And +Michael Armbrust) >>>>>> >>>>>> If you think so, do you think it's okay that we change the return >>>>>> value silently? Then, I'm wondering why we reverted `TRIM` functions >>>>>> then? >>>>>> >>>>>> > Are we sure "not padding" is "incorrect"? >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < >>>>>> gourav.sengu...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> 100% agree with Reynold. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Gourav Sengupta >>>>>>> >>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Are we sure "not padding" is "incorrect"? >>>>>>>> >>>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty >>>>>>>> of databases don't actually pad. >>>>>>>> >>>>>>>> >>>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html >>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> >>>>>>>> : >>>>>>>> "Snowflake currently deviates from common CHAR semantics in that >>>>>>>> strings >>>>>>>> shorter than the maximum length are not space-padded at the end." >>>>>>>> >>>>>>>> MySQL: >>>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql >>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < >>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, Reynold. >>>>>>>>> >>>>>>>>> Please see the following for the context. >>>>>>>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136 >>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0> >>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE >>>>>>>>> TABLE syntax" >>>>>>>>> >>>>>>>>> I raised the above issue according to the new rubric, and the >>>>>>>>> banning was the proposed alternative to reduce the potential issue. >>>>>>>>> >>>>>>>>> Please give us your opinion since it's still PR. >>>>>>>>> >>>>>>>>> Bests, >>>>>>>>> Dongjoon. >>>>>>>>> >>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the >>>>>>>>>> hell out of both new and old users? >>>>>>>>>> >>>>>>>>>> For old users, their old code that was working for char(3) would >>>>>>>>>> now stop working. >>>>>>>>>> >>>>>>>>>> For new users, depending on whether the underlying metastore >>>>>>>>>> char(3) is either supported but different from ansi Sql (which is >>>>>>>>>> not that >>>>>>>>>> big of a deal if we explain it) or not supported. >>>>>>>>>> >>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < >>>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, All. >>>>>>>>>>> >>>>>>>>>>> Apache Spark has been suffered from a known consistency issue on >>>>>>>>>>> `CHAR` type behavior among its usages and configurations. However, >>>>>>>>>>> the >>>>>>>>>>> evolution direction has been gradually moving forward to be >>>>>>>>>>> consistent >>>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The >>>>>>>>>>> following >>>>>>>>>>> is the summary. >>>>>>>>>>> >>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different >>>>>>>>>>> result. >>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a >>>>>>>>>>> fallback to Hive behavior.) >>>>>>>>>>> >>>>>>>>>>> spark-sql> CREATE TABLE t1(a CHAR(3)); >>>>>>>>>>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC; >>>>>>>>>>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET; >>>>>>>>>>> >>>>>>>>>>> spark-sql> INSERT INTO TABLE t1 SELECT 'a '; >>>>>>>>>>> spark-sql> INSERT INTO TABLE t2 SELECT 'a '; >>>>>>>>>>> spark-sql> INSERT INTO TABLE t3 SELECT 'a '; >>>>>>>>>>> >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>>>>> a 3 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>>>>> a 3 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>>>>> a 2 >>>>>>>>>>> >>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent. >>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback >>>>>>>>>>> to Hive behavior.) >>>>>>>>>>> >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>>>>> a 3 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>>>>> a 2 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>>>>> a 2 >>>>>>>>>>> >>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` >>>>>>>>>>> clause) became consistent. >>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` >>>>>>>>>>> provides a fallback to Hive behavior.) >>>>>>>>>>> >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>>>>> a 2 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>>>>> a 2 >>>>>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>>>>> a 2 >>>>>>>>>>> >>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` >>>>>>>>>>> type in the following syntax to be safe. >>>>>>>>>>> >>>>>>>>>>> CREATE TABLE t(a CHAR(3)); >>>>>>>>>>> https://github.com/apache/spark/pull/27902 >>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0> >>>>>>>>>>> >>>>>>>>>>> This email is sent out to inform you based on the new policy we >>>>>>>>>>> voted. >>>>>>>>>>> The recommendation is always using Apache Spark's native type >>>>>>>>>>> `String`. >>>>>>>>>>> >>>>>>>>>>> Bests, >>>>>>>>>>> Dongjoon. >>>>>>>>>>> >>>>>>>>>>> References: >>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15 >>>>>>>>>>> >>>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E >>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0> >>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for >>>>>>>>>>> CREATE TABLE syntax", 2019/12/06 >>>>>>>>>>> >>>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E >>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> This email contains confidential information of and is the copyright >>>>>> of Infomedia. It must not be forwarded, amended or disclosed without >>>>>> consent of the sender. If you received this message by mistake, please >>>>>> advise the sender and delete all copies. Security of transmission on the >>>>>> internet cannot be guaranteed, could be infected, intercepted, or >>>>>> corrupted >>>>>> and you should ensure you have suitable antivirus protection in place. By >>>>>> sending us your or any third party personal details, you consent to (or >>>>>> confirm you have obtained consent from such third parties) to Infomedia’s >>>>>> privacy policy. http://www.infomedia.com.au/privacy-policy/ >>>>>> >>>>> >