Hi Darin,

I agree that there are pros and cons to working Cascading into Pirk.

Cascading comes with a good number of taps which will be valuable in
reading data from varied sources.

However, Cascading's current 'processing' core is focused on
Hadoop-MapReduce; the performance of MR for Pirk has been far less
efficient than other platforms such as Spark (I am going to quantify and
qualify this soon -- see PIRK-31). I see that they are incorporating
process frameworks such as Flink, but this is new (i.e. not yet mature).

That said, it can't hurt to hook Cascading into Pirk -- it just might not
turn out to be a big performance win. I'm up for trying it out, performing
some benchmark comparisons, and seeing where we end up.

Thanks!

Ellison Anne





On Fri, Jul 29, 2016 at 8:20 AM, Chris Harris <[email protected]>
wrote:

> I have mixed feelings on it.. I like Cascading, so my initial reaction is
> to want to do it.  However, we already have MapReduce code, and I know we'd
> pick up Flink and Tez with Cascading, but I'd rather pick up Flink via Beam
> instead (and I wouldn't be surprised if there's eventually a Tez DataRunner
> for Beam).  I'd like to see how a Cascading prototype of Pirk compares
> agains our code in MR.  If their optimizations help out a lot, it would be
> a nice win.
>
>
> On Thu, Jul 28, 2016 at 10:36 PM, Darin Johnson <[email protected]>
> wrote:
>
> > Cascading is a higher level API for Hadoop-mapreduce, Tez and Flink.  The
> > Pirk roadmap mentions support for a number of other frameworks (Flink and
> > Storm being two), this would take care of Flink and add Tez support as
> > well.
> >
> > If there's interest I'll add a JIRA and link other issues accordingly.
> >
> > I don't think there will be any license issues as:
> >
> >
> >    1.   Cascading is Apache Licensed.
> >    2.   Elastic Search dependencies are pulling in the dependencies
> >    already, and RAT passes.
> >
> > There are good reasons not to go with this approach as well. Including:
> >
> >    1. Cascading in not an Apache Project - it's pretty much only
> Concurrent
> >    calling the shots.
> >    2. Usually cascading is pretty good about optimizing Map/Reduce jobs,
> >    however Tez and Flink extensions are new so I'm uncertain about the
> >    performance hit vs native implementations.
> >
> > These may be blockers for inclusion in the project or making it part of a
> > contrib section.  Thought I'd open it up for discussion.
> >
> > Darin
> >
>

Reply via email to