[ 
https://issues.apache.org/jira/browse/MAHOUT-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000473#comment-13000473
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-593 at 2/28/11 7:23 PM:
------------------------------------------------------------------

{quote}DeleteOnClose is a cute way to stick in file deletion into a close 
method; if it really works nicely for you OK. There's File.deleteOnExit() if 
it's a question of temp file cleanup.{quote}
As you might guess, it is Inadco's practice we use :) The reason for that is 
difference in those approaches. I want the file deleted asap -- i.e. when the 
mapper or reducer is decommissioned, not when jvm exists. We actually use 
settings that allow mahout to reuse jvm for up to 10 tasks (to cope better with 
massive mapper numbers, spawning jvm for every 64M worth of input is way too 
expensive). Some jobs may create way too much files that we could afford to 
delay the cleanup.

{quote}Yes good move on standardizing commons-math dependency. That makes 
sense.{quote}
in fact we discussed it somewhere already . This is partly due to the fact that 
Mahout doesn't follow maven (somewhat convoluted imo) doctrine where you'd have 
to declare all you use in a parent under <dependencyManagement> _with_ versions 
and actual module would not have _versioned_ dependency. That achieves two main 
things: 
* all modules end up using the same version at runtime as they were compiled 
with;
* the projects that wish to embed this code base, could import those 
dependencies into their maven build by using <scope>import</scope> spec. (one 
can't import transitive dependencies, only those in <dependencyManagement> ones 
which is a problem with their approach imo). So... Ted was suggesting we create 
a separate issue devoted to bringing it all up in agreement with that maven 
ideology (clean out versions from dependencies in modules and move all of them 
under <dependencyManagement> in parent pom which is partly followed now, but 
not everywhere, as the case with math-commons demostrates). I was planning to 
look at it more closely and suggest a patch at a later time.

{quote}My only structural complaint is that this is not really using 
AbstractJob. I see the idea is to make several phases of the job runnable 
independently. Is that a realistic use case? Usually you run all the phases to 
get meaningful work done (perhaps with the option of restarting from phase n, 
to recover from failure – but that's handled already in AbstractJob).{quote}
I wouldn't describe it as "giving up". 
It's just we have MR pipelines that are closely coupled (i.e. expect predefined 
intermediate results, have common postprocessing to move/rearrange files, which 
is certainly the case in this case). This is not exclusive to Mahout, such 
pipelines pop up everywhere MR is used. So, for tightly coupled MR steps, IMO 
it doesn't make sense to generate arg[] and then reparse them using a CLI 
parser. It's too much hassle for something that is never intended to be used as 
a standalone job. 
On the other hand, for every job that we feel is uncoupled enough so that we 
are compelled to create its own CLI, sure, let's use AbstractJob and 
ToolRunner. I can only vote for it with both hands. In fact, i am all for 
having it as a standard. Got a CLI?-- use AbstractJob, that's the rule here. No 
CLI and never will be? -- single driver works better then.

      was (Author: dlyubimov2):
    {quote}DeleteOnClose is a cute way to stick in file deletion into a close 
method; if it really works nicely for you OK. There's File.deleteOnExit() if 
it's a question of temp file cleanup.{quote}
As you might guess, it is Inadco's practice we use :) The reason for that is 
difference in those approaches. I want the file deleted asap -- i.e. when the 
mapper or reducer is decommissioned, not when jvm exists. We actually use 
settings that allow mahout to reuse jvm for up to 10 tasks (to cope better with 
massive mapper numbers, spawning jvm for every 64M worth of input is way too 
expensive). 

{quote}Yes good move on standardizing commons-math dependency. That makes 
sense.{quote}
in fact we discussed it somewhere already . This is partly due to the fact that 
Mahout doesn't follow maven (somewhat convoluted imo) doctrine where you'd have 
to declare all you use in a parent under <dependencyManagement> _with_ versions 
and actual module would not have _versioned_ dependency. That achieves two main 
things: 
* all modules end up using the same version at runtime as they were compiled 
with;
* the projects that wish to embed this code base, could import those 
dependencies into their maven build by using <scope>import</scope> spec. (one 
can't import transitive dependencies, only those in <dependencyManagement> ones 
which is a problem with their approach imo). So... Ted was suggesting we create 
a separate issue devoted to bringing it all up in agreement with that maven 
ideology (clean out versions from dependencies in modules and move all of them 
under <dependencyManagement> in parent pom which is partly followed now, but 
not everywhere, as the case with math-commons demostrates). I was planning to 
look at it more closely and suggest a patch at a later time.

{quote}My only structural complaint is that this is not really using 
AbstractJob. I see the idea is to make several phases of the job runnable 
independently. Is that a realistic use case? Usually you run all the phases to 
get meaningful work done (perhaps with the option of restarting from phase n, 
to recover from failure – but that's handled already in AbstractJob).{quote}
I wouldn't describe it as "giving up". 
It's just we have MR pipelines that are closely coupled (i.e. expect predefined 
intermediate results, have common postprocessing to move/rearrange files, which 
is certainly the case in this case). This is not exclusive to Mahout, such 
pipelines pop up everywhere MR is used. So, for tightly coupled MR steps, IMO 
it doesn't make sense to generate arg[] and then reparse them using a CLI 
parser. It's too much hassle for something that is never intended to be used as 
a standalone job. 
On the other hand, for every job that we feel is uncoupled enough so that we 
are compelled to create its own CLI, sure, let's use AbstractJob and 
ToolRunner. I can only vote for it with both hands. In fact, i am all for 
having it as a standard. Got a CLI?-- use AbstractJob, that's the rule here. No 
CLI and never will be? -- single driver works better then.
  
> Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure 
> compatibility with current Mahout dependencies.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-593
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-593
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.4
>            Reporter: Dmitriy Lyubimov
>             Fix For: 0.5
>
>         Attachments: MAHOUT-593.patch.gz, MAHOUT-593.patch.gz, 
> MAHOUT-593.patch.gz, SSVD-givens-CLI.pdf
>
>
> Current Mahout-376 patch requries 'new' hadoop API.  Certain elements of that 
> API (namely, multiple outputs) are not available in standard hadoop 0.20.2 
> release. As such, that may work only with either CDH or 0.21 distributions. 
>  In order to bring it into sync with current Mahout dependencies, a backport 
> of the patch to 'old' API is needed. 
> Also, some work is needed to resolve math dependencies. Existing patch relies 
> on apache commons-math 2.1 for eigen decomposition of small matrices. This 
> dependency is not currently set up in the mahout core. So, certain snippets 
> of code are either required to go to mahout-math or use Colt eigen 
> decompositon (last time i tried, my results were mixed with that one. It 
> seems to produce results inconsistent with those from mahout-math 
> eigensolver, at the very least, it doesn't produce singular values in sorted 
> order).
> So this patch is mainly moing some Mahout-376 code around.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to