RE: Why pig on spark use RDD API rather than DataFrame API ?

Zhang, Liyun Sun, 08 Jan 2017 18:26:36 -0800

Hi Jeff:
  Thanks for your interest, when this project is started (Aug in 2014)  
DataFrame API is not available and this is why we don't use this in the 
project.  Engineer in InMobi raised similar idea before. In my view, if 
DataFrame API is more suitable than RDD API, we can consider this in late 
optimization work after first release. Now you can file a subtask on 
PIG-4856(an umbrella jira for optimization work) and work on it if have 
interest.




Best Regards
Kelly Zhang/Zhang,Liyun



-----Original Message-----
From: Jeff Zhang [mailto:zjf...@gmail.com] 
Sent: Sunday, January 8, 2017 10:13 AM
To: dev@pig.apache.org
Subject: Why pig on spark use RDD API rather than DataFrame API ?

Hi Folks,

I am very interested on the project of pig on spark. When I read the code, I 
find that the current implementation is based on spark RDD API. I don't know 
the original background (maybe when this project is started, DataFrame API is 
not available) , but for now I feel DataFrame API might be more suitable than 
RDD API. Here's 2 advantages of DataFrame API I can think of:
1.  DataFrame API is easier to use than RDD API, although it is not flexible 
than RDD, but I think Pig's tuple data structure is very similar with that of 
DataFrame. I think it should be able to map each pig operation to data frame 
operation. If not, we can give feedback to spark community.
2.  Spark's catalyst provide lots of optimization on DataFrame. If we use 
DataFrame API, we can leverage lots of optimization in catalyst rather than 
reinvent the wheel in pig.

What do you think ? Thanks

RE: Why pig on spark use RDD API rather than DataFrame API ?

Reply via email to