Sorry, cross posting to save time.

I now have a WIP of Cascading 1.2 that includes support for Riffle annotations.

Riffle is an Apache licensed library that includes Java annotations for marking 
lifecycle and dependency methods on a 'process' object.

That is, you can create custom objects with 'start' and 'stop' methods, as well 
as with getters for incoming/outgoing resources (input files, and output files).

With a collection of such objects, each one for a particular task like running 
a copy job, or Mahout process, you can have either Riffle or Cascading chain 
and execute all the processes in dependency order.

You can see more about Riffle here (which includes a tool to run a collection 
of processes):
http://github.com/cwensel/riffle

You can download WIP builds for Cascading 1.2 (1.1 is the current stable 
version) here:
http://www.concurrentinc.com/downloads/

Note that Riffle is very early stage (and likely naive), and the Cascading 
support is likely to evolve before the 1.2 final release (sometime this fall).

The long term goal here is to allow Mahout and other projects to apply the 
annotations, and then third party tools can be used to run the processes.

For you Cascading users, writing a simple DistCp wrapper (or putting the 
annotations directly on hadoop DistCp object, would allow a efficient copy to 
run inside of a Cascade process along side your Flow instances.

Or more importantly, you can write iterative processes (e.g. page rank, etc) 
that act like a single process even though internally there is a unknown number 
of Flows being created on the fly. (I'm running a connected component algorithm 
that requires multiple Flows/passes in production now as a Riffle object)

Please feel free to fork and tweak.

ckw

--
Chris K Wensel
[email protected]
http://www.concurrentinc.com

Reply via email to