Clustering Jobs and Drivers

Jeff Eastman Wed, 12 May 2010 09:48:40 -0700

With the recent removal of all the clustering Job classes we haveintroduced a fatal bug in all of the synthetic control examples. In theoriginal implementation, the Jobs were responsible for deleting theiroutput directory prior to running the various Drivers which did notdelete output. In removing the clustering Jobs this deletionresponsibility was moved to the Drivers. Problem is, the syntheticcontrol examples transform the input file to the output/data directorybefore calling the clustering Driver, which now also zaps output (andthus it's input in this case) causing file not found errors. I see 4possible solutions:


  1. Reinstate the Job files, giving them the responsibility to delete
     their output directory and removing that responsibility from all
     Drivers.This will involve some code duplication in the Job and
     Driver main methods which can be addressed by refactoring.
  2. Leave the Drivers as-is and just remove their output deletion.
     This puts a bit more burden on the user but makes constructing job
     chains with clustering computations possible.
  3. Modify synthetic control examples to use a different, non-output,
     directory for the converted data and leave the Drivers alone.
  4. Finally, since chains of clustering jobs usually call the driver's
     static methods and not main, just move output deletion to main.
     The rub here is that sequences of command-line invocations would
     be problematic.

I'd like to move towards a consistent pattern across Mahout (the reasonfor removing Jobs in the first place). I'm leaning towards #1 but wouldlike some feedback esp. from Sean and Robin who (iirc) started this ballrolling.

Clustering Jobs and Drivers

Reply via email to