Hi Darin, I agree that there are pros and cons to working Cascading into Pirk.
Cascading comes with a good number of taps which will be valuable in reading data from varied sources. However, Cascading's current 'processing' core is focused on Hadoop-MapReduce; the performance of MR for Pirk has been far less efficient than other platforms such as Spark (I am going to quantify and qualify this soon -- see PIRK-31). I see that they are incorporating process frameworks such as Flink, but this is new (i.e. not yet mature). That said, it can't hurt to hook Cascading into Pirk -- it just might not turn out to be a big performance win. I'm up for trying it out, performing some benchmark comparisons, and seeing where we end up. Thanks! Ellison Anne On Fri, Jul 29, 2016 at 8:20 AM, Chris Harris <[email protected]> wrote: > I have mixed feelings on it.. I like Cascading, so my initial reaction is > to want to do it. However, we already have MapReduce code, and I know we'd > pick up Flink and Tez with Cascading, but I'd rather pick up Flink via Beam > instead (and I wouldn't be surprised if there's eventually a Tez DataRunner > for Beam). I'd like to see how a Cascading prototype of Pirk compares > agains our code in MR. If their optimizations help out a lot, it would be > a nice win. > > > On Thu, Jul 28, 2016 at 10:36 PM, Darin Johnson <[email protected]> > wrote: > > > Cascading is a higher level API for Hadoop-mapreduce, Tez and Flink. The > > Pirk roadmap mentions support for a number of other frameworks (Flink and > > Storm being two), this would take care of Flink and add Tez support as > > well. > > > > If there's interest I'll add a JIRA and link other issues accordingly. > > > > I don't think there will be any license issues as: > > > > > > 1. Cascading is Apache Licensed. > > 2. Elastic Search dependencies are pulling in the dependencies > > already, and RAT passes. > > > > There are good reasons not to go with this approach as well. Including: > > > > 1. Cascading in not an Apache Project - it's pretty much only > Concurrent > > calling the shots. > > 2. Usually cascading is pretty good about optimizing Map/Reduce jobs, > > however Tez and Flink extensions are new so I'm uncertain about the > > performance hit vs native implementations. > > > > These may be blockers for inclusion in the project or making it part of a > > contrib section. Thought I'd open it up for discussion. > > > > Darin > > >
