I was talking a few weeks ago with Chris Wensel of Cascading fame.  The
topic was how projects like Mahout need workflow for gluing together
analysis steps (Jake's command line props stuff not-withstanding) but
existing workflow systems have trouble helping us out.  Cascading is GPL,
Ouzle is perpetually not quite ready and pig is too ego-centric to make it
easy to integrate.

The discussion verged into Chris' thoughts about how annotations should be
useful for making different kinds of programs amenable to integration into
workflow systems.  Out of this came the idea that there could be a non-GPL
annotation that could be be consumed by a GPL workflow system without any
license question.  It became clear that this annotation idea would also make
it easy to integrate tasks into different workflow systems.

At this point Chris did as Chris tends to do and he took action.  He created
not only the annotation system, but also created a small topological sort
work-flow manager as a proof of concept and reference implementation.  This
work-flow manager is feature deficient relative to Cascading in that it
doesn't restructure map-reduce programs, nor does it provide operations like
group or join.  What is does have is an Apache license and what it would
allow is to easily allow Mahout programs have a workflow capability without
a license nightmare.

The new workflow system is called Riffle (being the little cousin of
Cascading after all) and Chris has produced a Git repository for the code.

The basic way that this works is that it is assumed that there are multiple
process steps and that each process step can be interrogated to get a list
of input and output dependencies.  The methods that return these
dependencies are marked with @DependencyIncoming or @DependencyOutgoing.
The application is responsible for adding and managing these dependencies.
Examples of input and output dependencies might be the names of HDFS files
or directories.  Once these dependencies have been defined, the workflow is
invoked and it starts workflow tasks in an order that guarantees that all of
the inputs for a task are available before it is run.

What do people think about this?  Is it as useful as I think it is?  Did I
not give enough information to even tell?

---------- Forwarded message ----------
From: Chris K Wensel <ch...@wensel.net>
Date: Mon, Mar 22, 2010 at 3:24 PM
Subject: riffle
To: Ted Dunning <ted.dunn...@gmail.com>


OK, lets stick it out there and see what happens.

http://github.com/cwensel/riffle

chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Reply via email to