Re: [GSOC] Array of questions

Jake Mannix Tue, 25 May 2010 20:52:41 -0700

Hi Shannon,

On Tue, May 25, 2010 at 8:10 PM, Shannon Quinn <[email protected]> wrote:
>
>
> 1) There was a thread early on regarding having the students keep blogs
> about their projects. Should we be using our own blogs for this purpose, or
> do this somewhere on the Mahout wiki, or other?
>


Blogging about your project is great, as is putting stuff on the wiki, but
make sure you post on this list a link to wherever you put it, because this
list is where all real communication between Mahout community members
happens.


>
> 2) In getting a feel for Mahout, I've been running a few of the examples on
> my own, and have noticed that if I supply the "-h" argument by itself to
> some of the available programs, I get an exception, followed by the list of
> available options for that program. For instance, in running "./bin/mahout
> seq2sparse -h":
>

That is almost definitely a bug in either AbstractJob, or the Driver class
which is using it.  File a JIRA ticket for it, and earn brownie points from
all of us (find out where in the code it's barfing, get even Bigger points,
and if you post too many patches fixing said problems, and you'll wind
up inadvertently becoming a committer!).

3) Saving the best for last :) I'm having some trouble getting off the
> ground with my code, and was hoping I could bounce some ideas around with
> the experts who are intimately familiar with the architecture (I'm working
> to implement MAHOUT-363).
>

Ok, on to the meaty stuff, and just in time, I'm back from vacation
and I'm all lonely without my family because they're still on vacation
themselves, so I've got some time to look at this (and in fact I'm the
guy who wrote our current distributed eigen-solving code)


> I see that there are drivers associated with each program, specifically to
> handle initializing the algorithm and process inputs from the command line.
> It looks like the functions main() and runJob() are always present, along
> with other high-level methods (though private) that invoke the other
> classes
> in the respective clustering folders. I also see all the clustering
> algorithms have mapper classes, responsible for implementing the map/reduce
> portion.
>

So at a high level, you want to make a Driver class which is a subclass of
AbstractJob, which will have an amazingly trivial main() method which
just calls ToolRunner.run(new MyConcreteJob(), args);

Then you need to implement the run(String[] args) method, where the
main actions go.


> I suppose my question here is, is there a "standard" architecture for
> implementing a new algorithm aside from what I've just described (i.e. am I
> missing anything)? Are there any superclasses/interfaces I should pay
> particular attention to in terms of extending/implementing? Also, my
> algorithm depends in no small part on calculating eigenvalues and
> eigenvectors of affinity/markov transition matrices, and I understand that
> Mahout has a parallel eigensolver implemented. How can I make use of this?
>

For the specific case where you're going to be doing lots of vector
computations, and in even more specific, doing eigenvector computations,
you want to get familiar the following class:

  org.apache.mahout.math.hadoop.DistributedRowMatrix

This is a class which has lots of methods build around manipulating a
SequenceFile<IntWritable,VectorWritable>, which is your HDFS
representation of big distributed matrix.

To use this class, you need to make sure that you have a path to
a SequenceFile as above (the output of seq2sparse will give you
a SequenceFile<Text,VectorWritable>, so you may need to write
a simple script to turn that Text into unique integers), but then
your constructor for DistributedRowMatrix is really just the path
to this SequenceFile, and it's sizes (row and column max values).

Once in your run() method, you have a DistributedRowMatrix
instantiated, you can do lots of distributed things: multiply the
entire matrix a vector by doing matrix.times(Vector v) (useful
for checking to see if it's an eigenvector or not!), if you have
two DistributedRowMatrix instances, with the same number
of rows, you can compute a^{transpose}*b by doing a.times(b)
(the should really be called a.transposeTimes(b), I know...),
etc.  These method calls will fire off map-reduce jobs on
your Hadoop cluster for you, and will return when the data
is in its destination, and you will then have a handle on the
output (it's the return value).

If you want eigenvectors, look at

 o.a.m.math.hadoop.decomposer.DistributedLanczosSolver

and in particular it's run() method.  It instantiates a
DistributedRowMatrix as I describe above, then solves for
the specified number of eigenvectors/values.

You may not even need to write very much "map/reduce"
code, as the MapReduce jobs which have already been
written can be immediately leveraged for your work.

And as always, any other "general" pointers regarding the overall Mahout
> architecture? :)
>

Actually, I have some questions for you about the EigenCuts
algorithm.  Well, a single, very general one: how does it really work?
You describe that you need to compute eigenvectors of the
transition matrix of a Markov chain representation of the data...
I'm not sure I understand exactly what you mean by this.  What
kind of data do you start with, for one thing?  And how many
eigenvectors do you try to get?  What do you do to "perturb"
the weights between nodes?  And what does the "cut" operation
look like?

Now that you've been looking at this algorithm in more detail,
can you sum it up in a simpler way, for busy people who are
too lazy to try and understand the NIPS paper it is written up
in? :)

  -jake


>
> I apologize if any of these questions are pedantic...and I apologize in
> advance, as I'll likely be a regular fixture here with more questions as
> the
> summer continues :) Thank you very much for your help!
>
> Regards,
> Shannon
>

Re: [GSOC] Array of questions

Reply via email to