Hi Jeff: Thanks for your interest, when this project is started (Aug in 2014) DataFrame API is not available and this is why we don't use this in the project. Engineer in InMobi raised similar idea before. In my view, if DataFrame API is more suitable than RDD API, we can consider this in late optimization work after first release. Now you can file a subtask on PIG-4856(an umbrella jira for optimization work) and work on it if have interest.
Best Regards Kelly Zhang/Zhang,Liyun -----Original Message----- From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Sunday, January 8, 2017 10:13 AM To: dev@pig.apache.org Subject: Why pig on spark use RDD API rather than DataFrame API ? Hi Folks, I am very interested on the project of pig on spark. When I read the code, I find that the current implementation is based on spark RDD API. I don't know the original background (maybe when this project is started, DataFrame API is not available) , but for now I feel DataFrame API might be more suitable than RDD API. Here's 2 advantages of DataFrame API I can think of: 1. DataFrame API is easier to use than RDD API, although it is not flexible than RDD, but I think Pig's tuple data structure is very similar with that of DataFrame. I think it should be able to map each pig operation to data frame operation. If not, we can give feedback to spark community. 2. Spark's catalyst provide lots of optimization on DataFrame. If we use DataFrame API, we can leverage lots of optimization in catalyst rather than reinvent the wheel in pig. What do you think ? Thanks