Thanks guys, let's make it in the next release. And hope the first release will be coming soon.
Pallavi Rao <pallavi....@inmobi.com>于2017年1月9日周一 下午1:03写道: > Yes. That was the first question I asked when I started work on Pig on > Spark. After investigating a little more, I realized that the current > design does not allow for easy use of DataFrame API. We do an operator by > operator substitution and use Tuple as the datatype. We would end up > converting RDDs to DataFrames and visa-versa, which is not really optimal. > > So, as Kelly said, we should take up that optimization post one release. > And, we would even move to Dataset API then. > > On Mon, Jan 9, 2017 at 7:53 AM, Zhang, Liyun <liyun.zh...@intel.com> > wrote: > > > Hi Jeff: > > Thanks for your interest, when this project is started (Aug in 2014) > > DataFrame API is not available and this is why we don't use this in the > > project. Engineer in InMobi raised similar idea before. In my view, if > > DataFrame API is more suitable than RDD API, we can consider this in late > > optimization work after first release. Now you can file a subtask on > > PIG-4856(an umbrella jira for optimization work) and work on it if have > > interest. > > > > > > > > Best Regards > > Kelly Zhang/Zhang,Liyun > > > > > > > > -----Original Message----- > > From: Jeff Zhang [mailto:zjf...@gmail.com] > > Sent: Sunday, January 8, 2017 10:13 AM > > To: dev@pig.apache.org > > Subject: Why pig on spark use RDD API rather than DataFrame API ? > > > > Hi Folks, > > > > I am very interested on the project of pig on spark. When I read the > code, > > I find that the current implementation is based on spark RDD API. I don't > > know the original background (maybe when this project is started, > DataFrame > > API is not available) , but for now I feel DataFrame API might be more > > suitable than RDD API. Here's 2 advantages of DataFrame API I can think > of: > > 1. DataFrame API is easier to use than RDD API, although it is not > > flexible than RDD, but I think Pig's tuple data structure is very similar > > with that of DataFrame. I think it should be able to map each pig > operation > > to data frame operation. If not, we can give feedback to spark community. > > 2. Spark's catalyst provide lots of optimization on DataFrame. If we use > > DataFrame API, we can leverage lots of optimization in catalyst rather > than > > reinvent the wheel in pig. > > > > What do you think ? Thanks > > > > -- > _____________________________________________________________ > The information contained in this communication is intended solely for the > use of the individual or entity to whom it is addressed and others > authorized to receive it. It may contain confidential or legally privileged > information. If you are not the intended recipient you are hereby notified > that any disclosure, copying, distribution or taking any action in reliance > on the contents of this information is strictly prohibited and may be > unlawful. If you have received this communication in error, please notify > us immediately by responding to this email and then delete it from your > system. The firm is neither liable for the proper and complete transmission > of the information contained in this communication nor for any delay in its > receipt. >