[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379499#comment-14379499
 ] 

Cheng Lian commented on SPARK-6450:
-----------------------------------

Here is a simpler Spark shell snippet for reproduction:
{noformat}
    import sqlContext._

    sql(
      """CREATE TABLE IF NOT EXISTS ms_convert (key INT)
        |STORED AS PARQUET
      """.stripMargin)

    // This shouldn't throw AnalysisException
    val analyzed = sql(
      """SELECT key FROM ms_convert
        |UNION ALL
        |SELECT key FROM ms_convert
      """.stripMargin).queryExecution.analyzed
{noformat}
[~marmbrus] has nailed down the root cause: the {{ParquetConversions}} analysis 
rule generates a hash map, which maps from the original {{MetastoreRelation}} 
instances to the newly created {{ParquetRelation2}} instances. However, 
{{MetastoreRelation.equals}} doesn't compare output attributes. Thus, if a 
single metastore Parquet table appears multiple times in a query, only a single 
entry ends up in the hash map, and the conversion is not correctly performed.

Proper fix for this issue should be overriding {{equals}} and {{hashCode}} for 
{{MetastoreRelation}}. However, this breaks more tests than expected. It's 
possible that these tests are ill-formed from the very beginning. But as 1.3.1 
release is approaching, we'd like to make the change more surgical to avoid 
potential regressions. The proposed fix here is to make both the metastore 
relations and their output attributes as keys in the hash map used in 
{{ParquetConversions}}.

> Metastore Parquet table conversion fails when a single metastore Parquet 
> table appears multiple times in the query
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6450
>                 URL: https://issues.apache.org/jira/browse/SPARK-6450
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Anand Mohan Tumuluri
>            Assignee: Michael Armbrust
>            Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from .... (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>       at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>       at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>       at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>       at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>       at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>       at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>       at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>       at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>       at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>       at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>       at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>       at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>       at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>       at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
> (select Orders.Country, count(1) CountryOrderCount from Orders where 
> to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
> Orders.Country,Orders.ProductCategory;
> {code}
> The temporary workaround is to add explicit alias for the table Orders
> {code}
> select o.Country, o.ProductCategory,count(1) from Orders o join (select 
> r.Country, count(1) CountryOrderCount from Orders r where 
> to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> o.Country where to_date(o.PlacedDate) > '2015-01-01' group by 
> o.Country,o.ProductCategory;
> {code}
> However this change not only affects self joins, it also seems to affect 
> union queries as well, like the below query which was again working 
> before(commit 9a151ce) got broken
> {code}
> select Orders.Country,null,count(1) OrderCount from Orders group by 
> Orders.Country,null
> union all
> select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
> null, Orders.ProductCategory
> {code}
> also fails with a Analysis exception.
> The workaround is to add different aliases for the tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to