[
https://issues.apache.org/jira/browse/MAHOUT-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994464#comment-12994464
]
Dmitriy Lyubimov commented on MAHOUT-593:
-----------------------------------------
I'll attend to the code style changes.
{quote}You may need to regenerate this patch as I think a few recent changes
will conflict{quote}
I'll do that but i think it might be more convenient to address at the time of
commit by doing git rebasing or something else having automatic 3-way merge
capability. (that's how i am accustomed to manage this). Otherwise, chances are
if you are delaying putting it in, those little changes in pom are touched
pretty frequently by all the other committers. I know Mahout uses svn though
so it might be more of a problem.
{quote}It's also preferable, I think to try to standardize how the Job is
implemented. I believe we're trying to use this AbstractJob class as the
template for Job implementations. Can this be put into that mold? I see at
least one implementation does this. {quote}. Yes, the main driver does accept
the configuration, but since it is multiple MR stuff, it just passes it on as a
Configuration object. The reasons for that are:
1) There's only one command line, not an individual one per job. Therefore,
there's only one Tool and only one TollRunner.run() call.
2) I need to control what is passed on to args. while general mapreduce
parameters are mainly allowed to pass thru, in certain cases i need to control
number of reducers and other specific hadoop stuff such as split size increase.
This is much more conveniently done on Configuration object than on an unparsed
args[]. I already thought of this and we already discussed it somewhere on the
list. It seems we like the AbstractJob but in this case i think it would be
more of a headache to interface thru run(args[]) instead of run(Configuration).
{quote}I think it's a decent convention to not have Mappers and Reducers as
inner classes. {quote} Sorry, not to start a discussion on this or anything,
but just for my information, is this is project-wide consensus or it is your
opinion? I don't think i am with you on this, nor do i think i see consensus in
the project. I did not scan all the Mahout tree for this but just scanning one
package up, all the DRM jobs seem to be happily use inner classes. At least it
seems there wasn't such consensus until pretty recently. Hadoop literature that
i scanned seems to use inner classes too.
If you really want to do this, do you also want partitioners and grouping
comparators be separate classes? Are you against inner class paradigm in
general? why just mappers?
If we do that it would multiply # of classes at least 3-fold while breaking
"job-as-minicomponent" paradigm. Inner classes are in a way natural concern
separation tool in a similar way the namespaces are in C++, but since java
doesn't really have grammar level concept of a namespace, hence inner static
classes are kind of its way to build 'minicomponents'. The only downside of
inner classes is in case if there's too much code to have in one file, so
there's nothing 'mini' about the component anymore, in which case one might
really consider packages to demarcate component boundaries. Which is not the
case here. (we are still in 'mini' realm I think).
In short, like Straustroup put it, object oriented code doesn't make code
shorter, nor more efficient. If anything, it makes it less efficient. Pretty
much the only reason to have it is to have same problem domain addressed in one
place because it presumably makes code more maintainable. If you really want to
to dump 5-odd mappers and about same number of reducers and drivers plus all
the helpers into a flat namespace, at least let's do it in packages then. But
then using packages would create a tree deeper that it needs to be.
(IMO).{quote}In general I think lots of inner classes that don't need to be
inner just makes things hard to find{quote} Eclipse type lookup works
regardless if class is inner or not. I don't use Idea a lot but I am pretty
sure they have it the same way.
If consensus is we'd rather have a package per job than an inner class (which
is what my preference in this case would be), i'll refactor each job into a
package with a bunch of classes sprouting underneath.
Makes sense?
{quote}Can you comment on how this replaces, extends, complements the 2+
existing SVD implementation we have already? (In the javadoc ideally){quote}
The main issue for all this is 376. there's *a lot* of discussion there --
probably more than it needs be.
In short, _compared to Lanczos SVD, i think opinions are that stochastic
approach can take on larger size problems and also compute them faster with
less memory requirements that Lanczos would -- but Lanczos would of course be
more precise. Some problems, in tens of billions of rows, may not require
precise solution (such as LSI), in which case Stochastic method might be
preferrable_. I am not sure about Hebbian though. I can put that summary above
in javadoc if that's the consensus, but as a matter of personal reputation, I
actually purposely avoided stating/speculating any comparisons as I did not
have a chance to run conclusive benchmarks on that kind of problem sizes over
Lanczos. So i wouldn't put that on my own accord only at the moment. (I know
that the speed of processing on this is comparable to or a little better than
to seq2sparse with the same size of problem and it works reasonably well for
us, but i don't have larger size experimental data).Like i suggested in 376,
let's do benchmarks to confirm.
The only thing that I am relatively sure about is that this SVD code adheres to
row key contract as I commented on MAHOUT-322 and last time i checked, Lanczos
did not, which is not a big deal, but renders Lanczos incompatible with direct
output of seq2sparse. That's about it I am more or less sure about.
> Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure
> compatibility with current Mahout dependencies.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-593
> URL: https://issues.apache.org/jira/browse/MAHOUT-593
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.4
> Reporter: Dmitriy Lyubimov
> Fix For: 0.5
>
> Attachments: MAHOUT-593.patch.gz
>
>
> Current Mahout-376 patch requries 'new' hadoop API. Certain elements of that
> API (namely, multiple outputs) are not available in standard hadoop 0.20.2
> release. As such, that may work only with either CDH or 0.21 distributions.
> In order to bring it into sync with current Mahout dependencies, a backport
> of the patch to 'old' API is needed.
> Also, some work is needed to resolve math dependencies. Existing patch relies
> on apache commons-math 2.1 for eigen decomposition of small matrices. This
> dependency is not currently set up in the mahout core. So, certain snippets
> of code are either required to go to mahout-math or use Colt eigen
> decompositon (last time i tried, my results were mixed with that one. It
> seems to produce results inconsistent with those from mahout-math
> eigensolver, at the very least, it doesn't produce singular values in sorted
> order).
> So this patch is mainly moing some Mahout-376 code around.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira