My guess is that this is because Parquet doesn't have a CHAR type. That
should be applied to strings by Spark for Parquet.

The reason from Parquet's perspective not to support CHAR is that we have
no expectation that it is a portable type. Non-SQL writers aren't going to
pad values with spaces, and it is a terrible idea for Parquet to silently
alter or truncate data to fit the CHAR type. There's also no performance
gain from CHAR because multi-byte UTF8 characters prevent us from using a
fixed-length binary field. The conclusion we came to is that CHAR is a
SQL-only type and has to be enforced by SQL engines: Spark should pad or
truncate values, and expect Parquet to faithfully hand back exactly what
was stored.

If Spark doesn't have logic for this, then it is probably relying on the
Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
a native CHAR type.

rb

On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, All.
>
> Currently, Spark shows different behavior when we uses CHAR types.
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> The reason I'm asking here is that it's a little bit old default behavior
> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>
> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
> Parquet has been de-factor standard in Spark also. (I'm not comparing this
> with the other DBMS.)
>
> I'm wondering which way we need to go or want to go in Spark?
>
> Bests,
> Dongjoon.
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to