Ah, ok, now I see what you are talking about. This is a bit of laziness on my part that I forgot about. The ModelDistribution is produced from 3-4 argument values (modelFactory, modelPrototype, distanceMeasure, prototypeSize) from the command line. You could just pass those argument values (all strings) and rebuild the ModelDistribution from those when needed.

On 1/17/11 10:23 AM, Sean Owen wrote:
The idea was to remove all use of JSON in an attempt to reduce the number of
different serialization approaches used. So at the moment I'm trying to
figure out what happens when I delete everything related to JSON. Most of it
goes quietly.

The only use that seems, well, actively used is the bit in DirichletDriver
where...

     conf.set(MODEL_DISTRIBUTION_KEY, modelDistribution.asJsonString());

... the ModelDistribution is serialized to a String and stuffed in the job's
Configuration object. The DirichletMapper.getDirichletState() method then
deserializes. In this way the model distribution is passed to workers via
Configuration.

As Ted says it seems like a minor abuse of "Configuration" but entirely
practical. Nothing's really wrong there other than the idea that perhaps
it'd be more uniform to pass this on the file system. Maybe at some point it
gets too big anyway to handle this way.

That's the only outstanding question for MAHOUT-510 at the moment as far as
I am concerned.


On Mon, Jan 17, 2011 at 5:17 PM, Jeff Eastman<[email protected]>wrote:

Dirichlet uses Writable to serialize its iteration output state (to
clusters-n). I'm confused about what your trying to do.



On 1/17/11 9:58 AM, Ted Dunning wrote:

This sort of thing is what the distributed cache was designed for.

On Mon, Jan 17, 2011 at 8:53 AM, Sean Owen<[email protected]>   wrote:

  Do you think the way forward is to leave it, or use Writable and write
the
model distribution to a file, or something else?



Reply via email to