Re: Why pig on spark use RDD API rather than DataFrame API ?

Jeff Zhang Mon, 09 Jan 2017 22:32:14 -0800

Thanks guys, let's make it in the next release. And hope the first release
will be coming soon.




Pallavi Rao <pallavi....@inmobi.com>于2017年1月9日周一 下午1:03写道：

> Yes. That was the first question I asked when I started work on Pig on
> Spark. After investigating a little more, I realized that the current
> design does not allow for easy use of DataFrame API. We do an operator by
> operator substitution and use Tuple as the datatype. We would end up
> converting RDDs to DataFrames and visa-versa, which is not really optimal.
>
> So, as Kelly said, we should take up that optimization post one release.
> And, we would even move to Dataset API then.
>
> On Mon, Jan 9, 2017 at 7:53 AM, Zhang, Liyun <liyun.zh...@intel.com>
> wrote:
>
> > Hi Jeff:
> >   Thanks for your interest, when this project is started (Aug in 2014)
> > DataFrame API is not available and this is why we don't use this in the
> > project.  Engineer in InMobi raised similar idea before. In my view, if
> > DataFrame API is more suitable than RDD API, we can consider this in late
> > optimization work after first release. Now you can file a subtask on
> > PIG-4856(an umbrella jira for optimization work) and work on it if have
> > interest.
> >
> >
> >
> > Best Regards
> > Kelly Zhang/Zhang,Liyun
> >
> >
> >
> > -----Original Message-----
> > From: Jeff Zhang [mailto:zjf...@gmail.com]
> > Sent: Sunday, January 8, 2017 10:13 AM
> > To: dev@pig.apache.org
> > Subject: Why pig on spark use RDD API rather than DataFrame API ?
> >
> > Hi Folks,
> >
> > I am very interested on the project of pig on spark. When I read the
> code,
> > I find that the current implementation is based on spark RDD API. I don't
> > know the original background (maybe when this project is started,
> DataFrame
> > API is not available) , but for now I feel DataFrame API might be more
> > suitable than RDD API. Here's 2 advantages of DataFrame API I can think
> of:
> > 1.  DataFrame API is easier to use than RDD API, although it is not
> > flexible than RDD, but I think Pig's tuple data structure is very similar
> > with that of DataFrame. I think it should be able to map each pig
> operation
> > to data frame operation. If not, we can give feedback to spark community.
> > 2.  Spark's catalyst provide lots of optimization on DataFrame. If we use
> > DataFrame API, we can leverage lots of optimization in catalyst rather
> than
> > reinvent the wheel in pig.
> >
> > What do you think ? Thanks
> >
>
> --
> _____________________________________________________________
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.
>

Re: Why pig on spark use RDD API rather than DataFrame API ?

Reply via email to