[Discussion] Char/VarChar Implementation in Carbondata

Mahesh Raju Somalaraju Sun, 10 Oct 2021 12:47:49 -0700

Dear Community,

This mail is regarding char/varchar implementation in carbondata. Recently
Spark3.1 is added char/varchar implementation[*#1*].


*command reference:*
1. create table charVarchar (id int, country varchar(10), name char(5),
addr string) stored as carbondata;
2. insert into charVarchar select 1, 'india', 'mahesh', 'bangalore');

     VarcharType(length): A variant of `StringType` which has a length
limitation. Data writing will fail if the input string exceeds the length
limitation. Note: this type can only be used in table schema, not
functions/operators.

      CharType(length): A variant of `VarcharType(length)` which is fixed
length. Reading column of type `CharType(n)` always returns string values
of length `n`. Char type column comparison will pad the short one to the
longer length.

*Current behaviour[CarbonData]:*
carbondata existing varchar implementation is different from spark.
Carbondata will treat the column as a varchar column if the string column
data type is configured with long_string_columns. long_string_columns we
can configure in table properties.

If we execute above commands with carbondata,
Carbondata will convert the char/varchar column data types to string column
and load the data without any length checks(char(5): it will allow more
than 5 characters, varchar(10): It will allow more than 10 characters).
   - String values we can give max[max size of short which is 32k]. If this
column is normal string
   - String values we can give max[max size of Integer which is more than
32k]. If this column is configured with long_string_columns.

*Spark & parquet Behaviour:*
1) If we give above commands with parquet, Storing as char/varchar data
types and validating the string lengths given in in create table commands.
If length mismatch then load/insert command will fail with parse exception.
2) If we mention char(n) where n is a big number and given small length in
loading then spark is padding with trailing spaces below cases.
  i) Do string padding when reading char type columns. Spark doesn't do it
at the writing side  to save storage space.
  ii) Do string padding when comparing char type column with string literal
or another char type column

More details about spark implementation we can refer PR[*#1*]

*Proposed Solution:*
1) Keep existing carbondata varchar implementation[String columns with
long_string_columns] as if we remove it then compatibility issues may come.
2) Support new column data types char(n) and varchar(n). Show them in
metadata as actual charType(n) and varCharType(n) instead of string columns.
3) Handle the length check for char/varchar in the
partition/non-partitioned case. If length mismatch then throw a parse
exception.
4) phase-1 develop for primitive and phase-2 check for complex columns

*Benefits*:
char and varchar are standard SQL types. varchar is widely used in other
databases instead of string type.

*#1 *https://github.com/apache/spark/pull/30412

*Please provide your valuable inputs and suggestions. Thank you in advance
!*

Thanks & Regards
-Mahesh Raju Somalaraju
github id: maheshrajus

[Discussion] Char/VarChar Implementation in Carbondata

Reply via email to