[jira] Commented: (MAHOUT-593) Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure compatibility with current Mahout dependencies.

Dmitriy Lyubimov (JIRA) Mon, 14 Feb 2011 12:11:21 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994464#comment-12994464
 ]


Dmitriy Lyubimov commented on MAHOUT-593:
-----------------------------------------

I'll attend to the code style changes. 
{quote}You may need to regenerate this patch as I think a few recent changes 
will conflict{quote}
I'll do that but i think it might be more convenient to address at the time of 
commit by doing git rebasing or something else having automatic 3-way merge 
capability. (that's how i am accustomed to manage this). Otherwise, chances are 
if you are delaying putting it in, those little changes in pom are touched 
pretty frequently by all the other committers.  I know Mahout uses svn though 
so it might be more of a problem.

{quote}It's also preferable, I think to try to standardize how the Job is 
implemented. I believe we're trying to use this AbstractJob class as the 
template for Job implementations. Can this be put into that mold? I see at 
least one implementation does this. {quote}. Yes, the main driver does accept 
the configuration, but since it is multiple MR stuff, it just passes it on as a 
Configuration object. The reasons for that are: 
1) There's only one command line, not an individual one per job. Therefore, 
there's only one Tool and only one TollRunner.run() call. 
2) I need to control what is passed on to args. while general mapreduce 
parameters are mainly allowed to pass thru, in certain cases i need to control 
number of reducers and other specific hadoop stuff such as split size increase. 
This is much more conveniently done on Configuration object than on an unparsed 
args[]. I already thought of this and we already discussed it somewhere on the 
list. It seems we like the AbstractJob but in this case i think it would be 
more of a headache to interface thru run(args[]) instead of run(Configuration). 
{quote}I think it's a decent convention to not have Mappers and Reducers as 
inner classes. {quote} Sorry, not to start a discussion on this or anything, 
but just for my information, is this is project-wide consensus or it is your 
opinion? I don't think i am with you on this, nor do i think i see consensus in 
the project. I did not scan all the Mahout tree for this but just scanning one 
package up, all the DRM jobs seem to be happily use inner classes. At least it 
seems there wasn't such consensus until pretty recently. Hadoop literature that 
i scanned seems to use inner classes too. 

If you really want to do this, do you also want partitioners and grouping 
comparators be separate classes? Are you against inner class paradigm in 
general? why just mappers? 

If we do that it would multiply # of classes at least 3-fold while breaking 
"job-as-minicomponent" paradigm. Inner classes are in a way natural concern 
separation tool in a similar way the namespaces are in C++, but since java 
doesn't really have grammar level concept of a namespace, hence inner static 
classes are kind of its way to build 'minicomponents'. The only downside of 
inner classes is in case if there's too much code to have in one file, so 
there's nothing 'mini' about the component anymore, in which case one might 
really consider packages to demarcate component boundaries. Which is not the 
case here. (we are still in 'mini' realm I think).

In short, like Straustroup put it, object oriented code doesn't make code 
shorter, nor more efficient. If anything, it makes it less efficient. Pretty 
much the only reason to have it is to have same problem domain addressed in one 
place because it presumably makes code more maintainable. If you really want to 
to dump 5-odd mappers and about same number of reducers and drivers plus all 
the helpers into a flat namespace, at least let's do it in packages then. But 
then using packages would create a tree deeper that it needs to be. 
(IMO).{quote}In general I think lots of inner classes that don't need to be 
inner just makes things hard to find{quote} Eclipse type lookup works 
regardless if class is inner or not. I don't use Idea a lot but I am pretty 
sure they have it the same way. 

If consensus is we'd rather have a package per job than an inner class (which 
is what my preference in this case would be), i'll refactor each job into a 
package with a bunch of classes sprouting underneath. 

Makes sense? 

{quote}Can you comment on how this replaces, extends, complements the 2+ 
existing SVD implementation we have already? (In the javadoc ideally){quote} 
The main issue for all this is 376. there's *a lot* of discussion there -- 
probably more than it needs be.

In short, _compared to Lanczos SVD, i think opinions are that stochastic 
approach can take on larger size problems and also compute them faster with 
less memory requirements that Lanczos would -- but Lanczos would of course be 
more precise. Some problems, in tens of billions of rows, may not require 
precise solution (such as LSI), in which case Stochastic method might be 
preferrable_. I am not sure about Hebbian though. I can put that summary above 
in javadoc if that's the consensus, but as a matter of personal reputation, I 
actually purposely avoided stating/speculating any comparisons as I did not 
have a chance to run conclusive benchmarks on that kind of problem sizes over 
Lanczos. So i wouldn't put that on my own accord only at the moment. (I know 
that the speed of processing on this is comparable to or a little better than 
to seq2sparse with the same size of problem and it works reasonably well for 
us, but i don't have larger size experimental data).Like i suggested in 376, 
let's do benchmarks to confirm.

The only thing that I am relatively sure about is that this SVD code adheres to 
row key contract as I commented on MAHOUT-322 and last time i checked, Lanczos 
did not, which is not a big deal, but renders Lanczos incompatible with direct 
output of seq2sparse. That's about it I am more or less sure about.


> Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure 
> compatibility with current Mahout dependencies.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-593
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-593
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.4
>            Reporter: Dmitriy Lyubimov
>             Fix For: 0.5
>
>         Attachments: MAHOUT-593.patch.gz
>
>
> Current Mahout-376 patch requries 'new' hadoop API.  Certain elements of that 
> API (namely, multiple outputs) are not available in standard hadoop 0.20.2 
> release. As such, that may work only with either CDH or 0.21 distributions. 
>  In order to bring it into sync with current Mahout dependencies, a backport 
> of the patch to 'old' API is needed. 
> Also, some work is needed to resolve math dependencies. Existing patch relies 
> on apache commons-math 2.1 for eigen decomposition of small matrices. This 
> dependency is not currently set up in the mahout core. So, certain snippets 
> of code are either required to go to mahout-math or use Colt eigen 
> decompositon (last time i tried, my results were mixed with that one. It 
> seems to produce results inconsistent with those from mahout-math 
> eigensolver, at the very least, it doesn't produce singular values in sorted 
> order).
> So this patch is mainly moing some Mahout-376 code around.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-593) Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure compatibility with current Mahout dependencies.

Reply via email to