[jira] [Created] (HIVE-26533) Column data type is lost when an Avro table with a BYTE column is written through spark-sql

xsys (Jira) Mon, 12 Sep 2022 07:51:09 -0700

xsys created HIVE-26533:
---------------------------

             Summary: Column data type is lost when an Avro table with a BYTE 
column is written through spark-sql
                 Key: HIVE-26533
                 URL: https://issues.apache.org/jira/browse/HIVE-26533
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 3.1.2
            Reporter: xsys



h3. Describe the bug

We are trying to store a table through the {{spark-sql}} interface with the 
{{Avro}} file format. The table's schema contains a column with the {{BYTE}} 
data type. Additionally, the column's name contains uppercase letters.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below 
message:
{code:java}
WARN HiveExternalCatalog: The table schema given by Hive 
metastore(struct<c0:int,c1:int>) is different from the schema when this table 
was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to 
the table schema from Hive metastore which is not case preserving.{code}
 
Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} 
data type has been converted to {{{}int{}}}, and the case sensitivity of the 
column name has been lost (it is converted to lowercase).
h3. Step to reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
 
Execute the following:
{code:java}
spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE 
"org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT 
"org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT 
"org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
Time taken: 0.359 seconds
spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive 
metastore(struct<c0:int,c1:int>) is different from the schema when this table 
was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to 
the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive 
metastore(struct<c0:int,c1:int>) is different from the schema when this table 
was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to 
the table schema from Hive metastore which is not case preserving.
Time taken: 1.605 seconds
spark-sql> desc hive_tinyint_avro;
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive 
metastore(struct<c0:int,c1:int>) is different from the schema when this table 
was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to 
the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive 
metastore(struct<c0:int,c1:int>) is different from the schema when this table 
was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to 
the table schema from Hive metastore which is not case preserving.
c0                      int
c1                      int // Data type and case-sensitivity lost
Time taken: 0.068 seconds, Fetched 2 row(s){code}
 
h3. Expected behavior

We expect the case sensitivity and data type to be preserved. We tried other 
formats like Parquet & ORC and the outcome is consistent with this expectation.

Here are the logs from our attempt at doing the same with Parquet:
{noformat}
spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
Time taken: 0.134 seconds
spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
Time taken: 0.995 seconds
spark-sql> desc hive_tinyint_parquet;
c0                      int
C1                      tinyint  // Data type and case-sensitivity preserved
Time taken: 0.092 seconds, Fetched 2 row(s){noformat}
h3. Root Cause
 
[TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s
 
[createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132]
 is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:
{code:java}
      case BYTE:
        schema = Schema.create(Schema.Type.INT);
        break;
      case SHORT:
        schema = Schema.create(Schema.Type.INT);
        break;
      case INT:
        schema = Schema.create(Schema.Type.INT);
        break;
{code}
 
Once converted into Avro schema, we lose track of the actual Hive schema 
specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the 
former is lost in the AvroSerde instance.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-26533) Column data type is lost when an Avro table with a BYTE column is written through spark-sql

Reply via email to