Re: [Discussion] Char/VarChar Implementation in Carbondata

Vikram Ahuja Thu, 21 Oct 2021 22:26:40 -0700

+1 on this feature. let us keep the exact same behaviour as spark

Regards
Vikram


On Mon, Oct 11, 2021 at 1:17 AM Mahesh Raju Somalaraju <
maheshraju.o...@gmail.com> wrote:

> Dear Community,
>
> This mail is regarding char/varchar implementation in carbondata. Recently
> Spark3.1 is added char/varchar implementation[*#1*].
>
> *command reference:*
> 1. create table charVarchar (id int, country varchar(10), name char(5),
> addr string) stored as carbondata;
> 2. insert into charVarchar select 1, 'india', 'mahesh', 'bangalore');
>
>      VarcharType(length): A variant of `StringType` which has a length
> limitation. Data writing will fail if the input string exceeds the length
> limitation. Note: this type can only be used in table schema, not
> functions/operators.
>
>       CharType(length): A variant of `VarcharType(length)` which is fixed
> length. Reading column of type `CharType(n)` always returns string values
> of length `n`. Char type column comparison will pad the short one to the
> longer length.
>
> *Current behaviour[CarbonData]:*
> carbondata existing varchar implementation is different from spark.
> Carbondata will treat the column as a varchar column if the string column
> data type is configured with long_string_columns. long_string_columns we
> can configure in table properties.
>
> If we execute above commands with carbondata,
> Carbondata will convert the char/varchar column data types to string column
> and load the data without any length checks(char(5): it will allow more
> than 5 characters, varchar(10): It will allow more than 10 characters).
>    - String values we can give max[max size of short which is 32k]. If this
> column is normal string
>    - String values we can give max[max size of Integer which is more than
> 32k]. If this column is configured with long_string_columns.
>
> *Spark & parquet Behaviour:*
> 1) If we give above commands with parquet, Storing as char/varchar data
> types and validating the string lengths given in in create table commands.
> If length mismatch then load/insert command will fail with parse exception.
> 2) If we mention char(n) where n is a big number and given small length in
> loading then spark is padding with trailing spaces below cases.
>   i) Do string padding when reading char type columns. Spark doesn't do it
> at the writing side  to save storage space.
>   ii) Do string padding when comparing char type column with string literal
> or another char type column
>
> More details about spark implementation we can refer PR[*#1*]
>
> *Proposed Solution:*
> 1) Keep existing carbondata varchar implementation[String columns with
> long_string_columns] as if we remove it then compatibility issues may come.
> 2) Support new column data types char(n) and varchar(n). Show them in
> metadata as actual charType(n) and varCharType(n) instead of string
> columns.
> 3) Handle the length check for char/varchar in the
> partition/non-partitioned case. If length mismatch then throw a parse
> exception.
> 4) phase-1 develop for primitive and phase-2 check for complex columns
>
> *Benefits*:
> char and varchar are standard SQL types. varchar is widely used in other
> databases instead of string type.
>
> *#1 *https://github.com/apache/spark/pull/30412
>
> *Please provide your valuable inputs and suggestions. Thank you in advance
> !*
>
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>

Re: [Discussion] Char/VarChar Implementation in Carbondata

Reply via email to