Re: FYI: The evolution on `CHAR` type behavior

Reynold Xin Mon, 16 Mar 2020 17:43:27 -0700

−User

char barely showed up (honestly negligible). I was comparing select vs select.


On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
> statements with `CHAR`?
> 
> > I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> We need to discuss more about what to do. This thread is what I expected
> exactly. :)
> 
> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
> it). I was merely pointing out that if we deviate away from SQL standard
> in any way we are considered "wrong" or "incorrect". That argument itself
> is flawed when plenty of other popular database systems also deviate away
> from the standard on this specific behavior.
> 
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>> I was merely pointing out that if we deviate away from SQL standard in any
>> way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away
>> from the standard on this specific behavior.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Thank you, Stephen and Reynold.
>>>> 
>>>> 
>>>> To Reynold.
>>>> 
>>>> 
>>>> The way I see the following is a little different.
>>>> 
>>>> 
>>>>       > CHAR is an undocumented data type without clearly defined
>>>> semantics.
>>>> 
>>>> Let me describe in Apache Spark User's View point.
>>>> 
>>>> 
>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>>>> Apache Spark 1.x without much documentation. In addition, there still
>>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>> 
>>>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>>>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>>>        Add back HiveContext and createExternalTable
>>>> 
>>>> Historically, we tried to make many SQL-based customer migrate their
>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>> 
>>>> Although Apache Spark didn't have a good document about the inconsistent
>>>> behavior among its data sources, Apache Hive has been providing its
>>>> documentation and many customers rely the behavior.
>>>> 
>>>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>>>> LanguageManual+Types
>>>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>>>> 
>>>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>>>> many existing huge tables were created by Apache Hive, not Apache Spark.
>>>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>>>> This was true because Apache Spark was added into the Hadoop-vendor
>>>> products later than Apache Hive.
>>>> 
>>>> 
>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>> Apache Spark because two SQL engines share the same tables.
>>>> 
>>>> For the following, technically, while Apache Hive doesn't changed its
>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>> away from the original Apache Spark old behaviors one-by-one.
>>>> 
>>>> 
>>>>       >  the value is already fucked up
>>>> 
>>>> 
>>>> The following is the change log.
>>>> 
>>>>       - When we switched the default value of `convertMetastoreParquet`.
>>>> (at Apache Spark 1.2)
>>>>       - When we switched the default value of `convertMetastoreOrc` (at
>>>> Apache Spark 2.4)
>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>>> `PARQUET` table at Apache Spark 3.0)
>>>> 
>>>> To sum up, this has been a well-known issue in the community and among the
>>>> customers.
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>>>> s...@infomedia.com.au ) > wrote:
>>>> 
>>>> 
>>>>> Hi there,
>>>>> 
>>>>> 
>>>>> I’m kind of new around here, but I have had experience with all of all the
>>>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>>>> Server as well as Postgresql.
>>>>> 
>>>>> 
>>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>>> means that such columns are always space padded, and they default to
>>>>> having this enabled (for ANSI compliance).
>>>>> 
>>>>> 
>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>> historical reasons not unlike what we have here.
>>>>> 
>>>>> 
>>>>> In my opinion we should push toward standards compliance where possible
>>>>> and then document where it cannot work.
>>>>> 
>>>>> 
>>>>> If users don’t like the padding on CHAR columns then they should change to
>>>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>>>> not dictate any sort of “padding".
>>>>> 
>>>>> 
>>>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>>>> consistently supported, but VARCHAR is a different animal and I would
>>>>> expect it to work consistently everywhere.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> Steve C
>>>>> 
>>>>> 
>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>>> dongjoon.h...@gmail.com ) > wrote:
>>>>>> 
>>>>>> Hi, Reynold.
>>>>>> (And +Michael Armbrust)
>>>>>> 
>>>>>> 
>>>>>> If you think so, do you think it's okay that we change the return value
>>>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>> 
>>>>>> 
>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ 
>>>>>> gmail.
>>>>>> com ( gourav.sengu...@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> 100% agree with Reynold.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gourav Sengupta
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>>>>> r...@databricks.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>>>>> databases don't actually pad.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ 
>>>>>>>> data-types-text. html
>>>>>>>> (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>>>>> strings shorter than the maximum length are not space-padded at the 
>>>>>>>> end."
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>>>>>>> why-char-dont-have-padding-in-mysql
>>>>>>>> (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ 
>>>>>>>> gmail. com
>>>>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>>> Hi, Reynold.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Please see the following for the context.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE 
>>>>>>>>> TABLE
>>>>>>>>> syntax"
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I raised the above issue according to the new rubric, and the banning 
>>>>>>>>> was
>>>>>>>>> the proposed alternative to reduce the potential issue.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>> 
>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>>>>> r...@databricks.com ) > wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell 
>>>>>>>>>> out
>>>>>>>>>> of both new and old users?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For old users, their old code that was working for char(3) would now 
>>>>>>>>>> stop
>>>>>>>>>> working. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For new users, depending on whether the underlying metastore char(3) 
>>>>>>>>>> is
>>>>>>>>>> either supported but different from ansi Sql (which is not that big 
>>>>>>>>>> of a
>>>>>>>>>> deal if we explain it) or not supported. 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ 
>>>>>>>>>> gmail. com
>>>>>>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Hi, All.
>>>>>>>>>>> 
>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue on 
>>>>>>>>>>> `CHAR`
>>>>>>>>>>> type behavior among its usages and configurations. However, the 
>>>>>>>>>>> evolution
>>>>>>>>>>> direction has been gradually moving forward to be consistent inside 
>>>>>>>>>>> Apache
>>>>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>>>>> summary.
>>>>>>>>>>> 
>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different 
>>>>>>>>>>> result.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback 
>>>>>>>>>>> to
>>>>>>>>>>> Hive behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to 
>>>>>>>>>>> Hive
>>>>>>>>>>> behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) 
>>>>>>>>>>> became
>>>>>>>>>>> consistent.
>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type 
>>>>>>>>>>> in the
>>>>>>>>>>> following syntax to be safe.
>>>>>>>>>>> 
>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 
>>>>>>>>>>> This email is sent out to inform you based on the new policy we 
>>>>>>>>>>> voted.
>>>>>>>>>>> The recommendation is always using Apache Spark's native type 
>>>>>>>>>>> `String`.
>>>>>>>>>>> 
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>> 
>>>>>>>>>>> References:
>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 
>>>>>>>>>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE 
>>>>>>>>>>> TABLE
>>>>>>>>>>> syntax", 2019/12/06
>>>>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 
>>>>>>>>>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> This email contains confidential information of and is the copyright of
>>>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>>>> of the sender. If you received this message by mistake, please advise the
>>>>> sender and delete all copies. Security of transmission on the internet
>>>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>>>> should ensure you have suitable antivirus protection in place. By sending
>>>>> us your or any third party personal details, you consent to (or confirm
>>>>> you have obtained consent from such third parties) to Infomedia’s privacy
>>>>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>>>>> http://www.infomedia.com.au/privacy-policy/ )
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to