Dear Community, This mail is regarding char/varchar implementation in carbondata. Recently Spark3.1 is added char/varchar implementation[*#1*].
*command reference:* 1. create table charVarchar (id int, country varchar(10), name char(5), addr string) stored as carbondata; 2. insert into charVarchar select 1, 'india', 'mahesh', 'bangalore'); VarcharType(length): A variant of `StringType` which has a length limitation. Data writing will fail if the input string exceeds the length limitation. Note: this type can only be used in table schema, not functions/operators. CharType(length): A variant of `VarcharType(length)` which is fixed length. Reading column of type `CharType(n)` always returns string values of length `n`. Char type column comparison will pad the short one to the longer length. *Current behaviour[CarbonData]:* carbondata existing varchar implementation is different from spark. Carbondata will treat the column as a varchar column if the string column data type is configured with long_string_columns. long_string_columns we can configure in table properties. If we execute above commands with carbondata, Carbondata will convert the char/varchar column data types to string column and load the data without any length checks(char(5): it will allow more than 5 characters, varchar(10): It will allow more than 10 characters). - String values we can give max[max size of short which is 32k]. If this column is normal string - String values we can give max[max size of Integer which is more than 32k]. If this column is configured with long_string_columns. *Spark & parquet Behaviour:* 1) If we give above commands with parquet, Storing as char/varchar data types and validating the string lengths given in in create table commands. If length mismatch then load/insert command will fail with parse exception. 2) If we mention char(n) where n is a big number and given small length in loading then spark is padding with trailing spaces below cases. i) Do string padding when reading char type columns. Spark doesn't do it at the writing side to save storage space. ii) Do string padding when comparing char type column with string literal or another char type column More details about spark implementation we can refer PR[*#1*] *Proposed Solution:* 1) Keep existing carbondata varchar implementation[String columns with long_string_columns] as if we remove it then compatibility issues may come. 2) Support new column data types char(n) and varchar(n). Show them in metadata as actual charType(n) and varCharType(n) instead of string columns. 3) Handle the length check for char/varchar in the partition/non-partitioned case. If length mismatch then throw a parse exception. 4) phase-1 develop for primitive and phase-2 check for complex columns *Benefits*: char and varchar are standard SQL types. varchar is widely used in other databases instead of string type. *#1 *https://github.com/apache/spark/pull/30412 *Please provide your valuable inputs and suggestions. Thank you in advance !* Thanks & Regards -Mahesh Raju Somalaraju github id: maheshrajus