Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
hi, I think that people have reported the same issue elsewhere, and this should be registered as a bug in SPARK https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html Regards, Gourav On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta wrote: > Hi

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, I have a table which is directly from S3 location and even a self join on that cached table is causing the data to be read from S3 again. The query plan in mentioned below: == Parsed Logical Plan == Aggregate [count(1) AS count#1804L] Project [user#0,programme_key#515] Join Inner,

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, the attached DAG shows that for the same table (self join) SPARK is unnecessarily getting data from S3 for one side of the join where as its able to use cache for the other side. Regards, Gourav On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta wrote: > Hi, >

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Ted Yu
The picture is a bit hard to read. I did a brief search but haven't found JIRA for this issue. Consider logging a SPARK JIRA. Cheers On Fri, Dec 18, 2015 at 4:37 AM, Gourav Sengupta wrote: > Hi, > > the attached DAG shows that for the same table (self join) SPARK

Re: HiveContext Self join not reading from cache

2015-12-17 Thread Gourav Sengupta
Hi Ted, The self join works fine on tbales where the hivecontext tables are direct hive tables, therefore table1 = hiveContext.sql("select columnA, columnB from hivetable1") table1.registerTempTable("table1") table1.cache() table1.count() and if I do a self join on table1 things are quite fine

Re: HiveContext Self join not reading from cache

2015-12-16 Thread Ted Yu
I did the following exercise in spark-shell ("c" is cached table): scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain == Physical Plan == Project [b#4] +- BroadcastHashJoin [a#3], [a#125], BuildRight :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation

HiveContext Self join not reading from cache

2015-12-16 Thread Gourav Sengupta
Hi, This is how the data can be created: 1. TableA : cached() 2. TableB : cached() 3. TableC: TableA inner join TableB cached() 4. TableC join TableC does not take the data from cache but starts reading the data for TableA and TableB from disk. Does this sound like a bug? The self join between