[jira] [Updated] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

Yuzhou Sun (Jira) Sat, 16 Oct 2021 20:38:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuzhou Sun updated SPARK-37027:
-------------------------------
    Attachment: SPARK-37027-test-example.patch

> Fix behavior inconsistent in Hive table when ‘path’ is provided in 
> SERDEPROPERTIES
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-37027
>                 URL: https://issues.apache.org/jira/browse/SPARK-37027
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5, 3.1.2
>            Reporter: Yuzhou Sun
>            Priority: Trivial
>         Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES 
> ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can 
> return doubled rows when reading the table. This issue seems to be an 
> extension of SPARK-30507.
>  Reproduce steps:
>  # Create table and insert records via Hive (Spark doesn't allow to insert 
> into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
>   `c1` LONG,
>   `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '<tableLocationPath>';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
>  # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will 
> return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH 
> SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, 
> Spark will read both rows under {{anotherPath}} and rows under 
> {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} 
> ‘s value. However, actually Hive seems to return only rows under 
> {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, 
> Spark won’t double the rows when {{'path'='<tableLocation>'}}. If 
> {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} 
> and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in 
> {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in 
> {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

Reply via email to