[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Yin Huai (JIRA) Wed, 22 Jun 2016 16:59:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345443#comment-15345443
 ]


Yin Huai commented on SPARK-16032:
----------------------------------

Let's look at two examples (tests were done using the 2.0 branch before changes 
in this jira). We have two tables having the same schema and same list of 
partition columns, but they are stored in two formats (Hive's Text format and 
Spark's parquet). I am using exactly same commands to insert three rows into 
these two tables.
{code}
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("DROP TABLE IF EXISTS hive_src")
spark.sql("CREATE TABLE hive_src (a INT, d INT) PARTITIONED BY (b INT, c INT) 
STORED AS TEXTFILE")
spark.sql("DROP TABLE IF EXISTS spark_src")
spark.sql("CREATE TABLE spark_src (a INT, b INT, c INT, d INT) USING PARQUET 
PARTITIONED BY (b, c)")

Seq((1, 2, 3, 4)).toDF("b", "c", "d", 
"a").write.mode("append").insertInto("hive_src")
Seq((5, 6, 7, 8)).toDF("c", "b", "d", 
"a").write.mode("append").insertInto("hive_src")
Seq((9, 10, 11, 12)).toDF("c", "b", "d", "a").write.partitionBy("b", 
"c").mode("append").insertInto("hive_src")
spark.table("hive_src").show

+---+---+---+---+
|  a|  d|  b|  c|
+---+---+---+---+
|  3|  4|  1|  2|
|  7|  8|  6|  5|
| 11| 12|  9| 10|
+---+---+---+---+

Seq((1, 2, 3, 4)).toDF("b", "c", "d", 
"a").write.mode("append").insertInto("spark_src")
Seq((5, 6, 7, 8)).toDF("c", "b", "d", 
"a").write.mode("append").insertInto("spark_src")
Seq((9, 10, 11, 12)).toDF("c", "b", "d", "a").write.partitionBy("b", 
"c").mode("append").insertInto("spark_src")
spark.table("spark_src").show

+---+---+---+---+
|  a|  d|  b|  c|
+---+---+---+---+
|  5|  6|  7|  8|
|  1|  2|  3|  4|
| 11| 12|  9| 10|
+---+---+---+---+
{code}

You can see that their results are different. For the Hive SerDe, we adjust the 
partition columns based on the column name. For the Data source table, we are 
not doing that. Also, I am not sure apply by-name resolution just to partition 
columns is a good idea. For a user, he/she will be confused on this behavior 
(he/she can assume that by-name resolution is always used).

Also, even if partitionBy cannot be used together with insertInto, a user can 
still use select to explicitly adjust the ordering of columns. He/she can use 
the table's schema to adjust the column ordering of the data based on column 
names. 



> Audit semantics of various insertion operations related to partitioned tables
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-16032
>                 URL: https://issues.apache.org/jira/browse/SPARK-16032
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Wenchen Fan
>            Priority: Critical
>         Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Reply via email to