[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

Cheng Lian (JIRA) Fri, 05 Sep 2014 02:13:50 -0700

Cheng Lian created SPARK-3414:
---------------------------------

             Summary: Case insensitivity breaks when unresolved relation 
contains attributes with upper case letter in their names
                 Key: SPARK-3414
                 URL: https://issues.apache.org/jira/browse/SPARK-3414
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.0.2
            Reporter: Cheng Lian
            Priority: Critical



Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")

val srdd = sql(
  """
    SELECT name, message
    FROM rawLogs
    JOIN (
      SELECT name
      FROM logFiles
    ) files
    ON rawLogs.filename = files.name
  """)

srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
    Join Inner, Some(('rawLogs.filename = name#2))
     LowerCaseSchema
      Subquery rawlogs
       SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
     Subquery files
      Project [name#2]
       LowerCaseSchema
        Subquery logfiles
         SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is now lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} is only executed once.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]                            Project [*]
! UnresolvedRelation None, boom, None    LowerCaseSchema
!                                         Subquery boom
!                                          Project ['name,'message]
!                                           Join Inner, Some(('rawLogs.filename 
= 'files.name))
!                                            LowerCaseSchema
!                                             Subquery rawlogs
!                                              SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!                                            Subquery files
!                                             Project ['name]
!                                              LowerCaseSchema
!                                               Subquery logfiles
!                                                SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

Reply via email to