[ 
https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anoop Johnson updated SPARK-31799:
----------------------------------
    Description: 
If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
metadata, breaking compatibility with other query engines like Hive and Presto. 
Here is a very simple example:

{code:sql}
CREATE TABLE test_csv (id String, name String)
USING csv
  LOCATION  's3://[...]'
;
{code}

If you describe the table using Presto, you will see:

{code:sql}
CREATE TABLE `test_csv` (`id` STRING, `name` STRING)
USING csv
OPTIONS (
  `serialization.format` '1',
  path 's3://anoopj-emr-dev-us-east-1/csvtest_ctas'
)

CREATE EXTERNAL TABLE `test_csv`(
  `col` array<string> COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='s3://[...]') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.SequenceFileInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
  's3://[...]/test_csv-__PLACEHOLDER__'
TBLPROPERTIES (
  'spark.sql.create.version'='2.4.4', 
  'spark.sql.sources.provider'='csv', 
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
 
  'transient_lastDdlTime'='1590196086')
  ;
{code}

 The table location is set to a placeholder value - the schema is always set to 
\{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence 
file even though the requested format is CSV.

But all the right metadata is written to the custom table properties with 
prefix `spark.sql`. However, Hive and Presto does not understand these table 
properties and this breaks them.

I could reproduce this with JSON too, but not with Parquet. I root-caused this 
issue to CSV and JSON tables not handled 
[here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
 in HiveSerde.scala.

Is there a reason why they are not handled? I could send a patch to fix this, 
but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
classpath, otherwise the table creation will fail.

  was:
If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
metadata, breaking compatibility with other query engines like Hive and Presto. 
Here is a very simple example:

 

{{{{{code:sql}}}}}

{{CREATE TABLE test_csv (id String, name String)}}
 {{USING csv}}
 LOCATION 's3://[...]'

{{{code}}}

If you describe the table using Presto, you will see:

 The table location is set to a placeholder value - the schema is always set to 
\{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence 
file even though the requested format is CSV.

But all the right metadata is written to the custom table properties with 
prefix `spark.sql`. However, Hive and Presto does not understand these table 
properties and this breaks them.

I could reproduce this with JSON too, but not with Parquet. I root-caused this 
issue to CSV and JSON tables not handled 
[here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
 in HiveSerde.scala.

Is there a reason why they are not handled? I could send a patch to fix this, 
but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
classpath, otherwise the table creation will fail.


> Spark Datasource Tables Creating Incorrect Hive Metadata
> --------------------------------------------------------
>
>                 Key: SPARK-31799
>                 URL: https://issues.apache.org/jira/browse/SPARK-31799
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Anoop Johnson
>            Priority: Major
>
> If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
> metadata, breaking compatibility with other query engines like Hive and 
> Presto. Here is a very simple example:
> {code:sql}
> CREATE TABLE test_csv (id String, name String)
> USING csv
>   LOCATION  's3://[...]'
> ;
> {code}
> If you describe the table using Presto, you will see:
> {code:sql}
> CREATE TABLE `test_csv` (`id` STRING, `name` STRING)
> USING csv
> OPTIONS (
>   `serialization.format` '1',
>   path 's3://anoopj-emr-dev-us-east-1/csvtest_ctas'
> )
> CREATE EXTERNAL TABLE `test_csv`(
>   `col` array<string> COMMENT 'from deserializer')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='s3://[...]') 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.SequenceFileInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
> LOCATION
>   's3://[...]/test_csv-__PLACEHOLDER__'
> TBLPROPERTIES (
>   'spark.sql.create.version'='2.4.4', 
>   'spark.sql.sources.provider'='csv', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
>  
>   'transient_lastDdlTime'='1590196086')
>   ;
> {code}
>  The table location is set to a placeholder value - the schema is always set 
> to \{{`col` array<string>. }}The serde/inputformat is wrong - it says 
> sequence file even though the requested format is CSV.
> But all the right metadata is written to the custom table properties with 
> prefix `spark.sql`. However, Hive and Presto does not understand these table 
> properties and this breaks them.
> I could reproduce this with JSON too, but not with Parquet. I root-caused 
> this issue to CSV and JSON tables not handled 
> [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
>  in HiveSerde.scala.
> Is there a reason why they are not handled? I could send a patch to fix this, 
> but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
> classpath, otherwise the table creation will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to