+1 on this feature. let us keep the exact same behaviour as spark Regards Vikram
On Mon, Oct 11, 2021 at 1:17 AM Mahesh Raju Somalaraju < maheshraju.o...@gmail.com> wrote: > Dear Community, > > This mail is regarding char/varchar implementation in carbondata. Recently > Spark3.1 is added char/varchar implementation[*#1*]. > > *command reference:* > 1. create table charVarchar (id int, country varchar(10), name char(5), > addr string) stored as carbondata; > 2. insert into charVarchar select 1, 'india', 'mahesh', 'bangalore'); > > VarcharType(length): A variant of `StringType` which has a length > limitation. Data writing will fail if the input string exceeds the length > limitation. Note: this type can only be used in table schema, not > functions/operators. > > CharType(length): A variant of `VarcharType(length)` which is fixed > length. Reading column of type `CharType(n)` always returns string values > of length `n`. Char type column comparison will pad the short one to the > longer length. > > *Current behaviour[CarbonData]:* > carbondata existing varchar implementation is different from spark. > Carbondata will treat the column as a varchar column if the string column > data type is configured with long_string_columns. long_string_columns we > can configure in table properties. > > If we execute above commands with carbondata, > Carbondata will convert the char/varchar column data types to string column > and load the data without any length checks(char(5): it will allow more > than 5 characters, varchar(10): It will allow more than 10 characters). > - String values we can give max[max size of short which is 32k]. If this > column is normal string > - String values we can give max[max size of Integer which is more than > 32k]. If this column is configured with long_string_columns. > > *Spark & parquet Behaviour:* > 1) If we give above commands with parquet, Storing as char/varchar data > types and validating the string lengths given in in create table commands. > If length mismatch then load/insert command will fail with parse exception. > 2) If we mention char(n) where n is a big number and given small length in > loading then spark is padding with trailing spaces below cases. > i) Do string padding when reading char type columns. Spark doesn't do it > at the writing side to save storage space. > ii) Do string padding when comparing char type column with string literal > or another char type column > > More details about spark implementation we can refer PR[*#1*] > > *Proposed Solution:* > 1) Keep existing carbondata varchar implementation[String columns with > long_string_columns] as if we remove it then compatibility issues may come. > 2) Support new column data types char(n) and varchar(n). Show them in > metadata as actual charType(n) and varCharType(n) instead of string > columns. > 3) Handle the length check for char/varchar in the > partition/non-partitioned case. If length mismatch then throw a parse > exception. > 4) phase-1 develop for primitive and phase-2 check for complex columns > > *Benefits*: > char and varchar are standard SQL types. varchar is widely used in other > databases instead of string type. > > *#1 *https://github.com/apache/spark/pull/30412 > > *Please provide your valuable inputs and suggestions. Thank you in advance > !* > > Thanks & Regards > -Mahesh Raju Somalaraju > github id: maheshrajus >