Cheng Lian created SPARK-3414:
---------------------------------
Summary: Case insensitivity breaks when unresolved relation
contains attributes with upper case letter in their names
Key: SPARK-3414
URL: https://issues.apache.org/jira/browse/SPARK-3414
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
import hiveContext._
case class LogEntry(filename: String, message: String)
case class LogFile(name: String)
sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
val srdd = sql(
"""
SELECT name, message
FROM rawLogs
JOIN (
SELECT name
FROM logFiles
) files
ON rawLogs.filename = files.name
""")
srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: *, tree:
Project [*]
LowerCaseSchema
Subquery boom
Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
LowerCaseSchema
Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1],
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
Subquery files
Project [name#2]
LowerCaseSchema
Subquery logfiles
SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is now lowercased.
The reason is that, during analysis phase, the
{{CaseInsensitiveAttributeReferences}} is only executed once.
When {{srdd}} is registered as temporary table {{boom}}, its original
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
UnresolvedRelation None, rawLogs, None
Subquery files
Project ['name]
UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.
And then, when {{select * from boom}} is been analyzed, the input logical plan
is:
{code}
Project [*]
UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}},
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
Project [*] Project [*]
! UnresolvedRelation None, boom, None LowerCaseSchema
! Subquery boom
! Project ['name,'message]
! Join Inner, Some(('rawLogs.filename
= 'files.name))
! LowerCaseSchema
! Subquery rawlogs
! SparkLogicalPlan (ExistingRdd
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at
basicOperators.scala:208)
! Subquery files
! Project ['name]
! LowerCaseSchema
! Subquery logfiles
! SparkLogicalPlan (ExistingRdd
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}})
is not lowercased, and thus causes the resolution failure.
A reasonable fix for this could be always register analyzed logical plan to the
catalog when registering temporary tables.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]