[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:46 AM:
-----------------------------------------------------------------

[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.


was (Author: simeons):
[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause it's 
own set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

> Incorrect UNION ALL behavior
> ----------------------------
>
>                 Key: SPARK-9813
>                 URL: https://issues.apache.org/jira/browse/SPARK-9813
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>              Labels: sql, union
>
> According to the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
> for UNION ALL:
> {quote}
> The number and names of columns returned by each select_statement have to be 
> the same. Otherwise, a schema error is thrown.
> {quote}
> Spark SQL silently swallows an error when the tables being joined with UNION 
> ALL have the same number of columns but different names.
> Reproducible example:
> {code}
> // This test is meant to run in spark-shell
> import java.io.File
> import java.io.PrintWriter
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.SaveMode
> val ctx = sqlContext.asInstanceOf[HiveContext]
> import ctx.implicits._
> def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
> def tempTable(name: String, json: String) = {
>   val path = dataPath(name)
>   new PrintWriter(path) { write(json); close }
>   ctx.read.json("file://" + path).registerTempTable(name)
> }
> // Note category vs. cat names of first column
> tempTable("test_one", """{"category" : "A", "num" : 5}""")
> tempTable("test_another", """{"cat" : "A", "num" : 5}""")
> //  +--------+---+
> //  |category|num|
> //  +--------+---+
> //  |       A|  5|
> //  |       A|  5|
> //  +--------+---+
> //
> //  Instead, an error should have been generated due to incompatible schema
> ctx.sql("select * from test_one union all select * from test_another").show
> // Cleanup
> new File(dataPath("test_one")).delete()
> new File(dataPath("test_another")).delete()
> {code}
> When the number of columns is different, Spark can even mix in datatypes. 
> Reproducible example (requires a new spark-shell session):
> {code}
> // This test is meant to run in spark-shell
> import java.io.File
> import java.io.PrintWriter
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.SaveMode
> val ctx = sqlContext.asInstanceOf[HiveContext]
> import ctx.implicits._
> def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
> def tempTable(name: String, json: String) = {
>   val path = dataPath(name)
>   new PrintWriter(path) { write(json); close }
>   ctx.read.json("file://" + path).registerTempTable(name)
> }
> // Note test_another is missing category column
> tempTable("test_one", """{"category" : "A", "num" : 5}""")
> tempTable("test_another", """{"num" : 5}""")
> //  +--------+
> //  |category|
> //  +--------+
> //  |       A|
> //  |       5| 
> //  +--------+
> //
> //  Instead, an error should have been generated due to incompatible schema
> ctx.sql("select * from test_one union all select * from test_another").show
> // Cleanup
> new File(dataPath("test_one")).delete()
> new File(dataPath("test_another")).delete()
> {code}
> At other times, when the schema are complex, Spark SQL produces a misleading 
> error about an unresolved Union operator:
> {code}
> scala> ctx.sql("""select * from view_clicks
>      | union all
>      | select * from view_clicks_aug
>      | """)
> 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
> union all
> select * from view_clicks_aug
> 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=view_clicks
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu      ip=unknown-ip-addr      
> cmd=get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=view_clicks
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu      ip=unknown-ip-addr      
> cmd=get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu      ip=unknown-ip-addr      
> cmd=get_table : db=default tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu      ip=unknown-ip-addr      
> cmd=get_table : db=default tbl=view_clicks_aug
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
>       at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
>       at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>       at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to