[
https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuzhou Sun updated SPARK-37027:
-------------------------------
Attachment: SPARK-37027-test-example.patch
> Fix behavior inconsistent in Hive table when ‘path’ is provided in
> SERDEPROPERTIES
> ----------------------------------------------------------------------------------
>
> Key: SPARK-37027
> URL: https://issues.apache.org/jira/browse/SPARK-37027
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.5, 3.1.2
> Reporter: Yuzhou Sun
> Priority: Trivial
> Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES
> ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can
> return doubled rows when reading the table. This issue seems to be an
> extension of SPARK-30507.
> Reproduce steps:
> # Create table and insert records via Hive (Spark doesn't allow to insert
> into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
> `c1` LONG,
> `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '<tableLocationPath>';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
> # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will
> return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH
> SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}},
> Spark will read both rows under {{anotherPath}} and rows under
> {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}}
> ‘s value. However, actually Hive seems to return only rows under
> {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}},
> Spark won’t double the rows when {{'path'='<tableLocation>'}}. If
> {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}}
> and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in
> {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in
> {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]