[
https://issues.apache.org/jira/browse/MAHOUT-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000473#comment-13000473
]
Dmitriy Lyubimov edited comment on MAHOUT-593 at 2/28/11 7:18 PM:
------------------------------------------------------------------
{quote}DeleteOnClose is a cute way to stick in file deletion into a close
method; if it really works nicely for you OK. There's File.deleteOnExit() if
it's a question of temp file cleanup.{quote}
As you might guess, it is Inadco's practice we use :) The reason for that is
difference in those approaches. I want the file deleted asap -- i.e. when the
mapper or reducer is decommissioned, not when jvm exists. We actually use
settings that allow mahout to reuse jvm for up to 10 tasks (to cope better with
massive mapper numbers, spawning jvm for every 64M worth of input is way too
expensive).
{quote}Yes good move on standardizing commons-math dependency. That makes
sense.{quote}
in fact we discussed it somewhere already . This is partly due to the fact that
Mahout doesn't follow maven (somewhat convoluted imo) doctrine where you'd have
to declare all you use in a parent under <dependencyManagement> _with_ versions
and actual module would not have _versioned_ dependency. That achieves two main
things:
* all modules end up using the same version at runtime as they were compiled
with;
* the projects that wish to embed this code base, could import those
dependencies into their maven build by using <scope>import</scope> spec. (one
can't import transitive dependencies, only those in <dependencyManagement> ones
which is a problem with their approach imo). So... Ted was suggesting we create
a separate issue devoted to bringing it all up in agreement with that maven
ideology (clean out versions from dependencies in modules and move all of them
under <dependencyManagement> in parent pom which is partly followed now, but
not everywhere, as the case with math-commons demostrates). I was planning to
look at it more closely and suggest a patch at a later time.
{quote}My only structural complaint is that this is not really using
AbstractJob. I see the idea is to make several phases of the job runnable
independently. Is that a realistic use case? Usually you run all the phases to
get meaningful work done (perhaps with the option of restarting from phase n,
to recover from failure – but that's handled already in AbstractJob).{quote}
I wouldn't describe it as "giving up".
It's just we have MR pipelines that are closely coupled (i.e. expect predefined
intermediate results, have common postprocessing to move/rearrange files, which
is certainly the case in this case). This is not exclusive to Mahout, such
pipelines pop up everywhere MR is used. So, for tightly coupled MR steps, IMO
it doesn't make sense to generate arg[] and then reparse them using a CLI
parser. It's too much hassle for something that is never intended to be used as
a standalone job.
On the other hand, for every job that we feel is uncoupled enough so that we
are compelled to create its own CLI, sure, let's use AbstractJob and
ToolRunner. I can only vote for it with both hands. In fact, i am all for
having it as a standard. Got a CLI?-- use AbstractJob, that's the rule here. No
CLI and never will be? -- single driver works better then.
was (Author: dlyubimov2):
{quote}DeleteOnClose is a cute way to stick in file deletion into a close
method; if it really works nicely for you OK. There's File.deleteOnExit() if
it's a question of temp file cleanup.{quote}
As you might guess, it is Inadco's practice we use :) The reason for that is
difference in those approaches. I want the file deleted asap -- i.e. when the
mapper or reducer is decommissioned, not when jvm exists. We actually use
settings that allow mahout to reuse jvm for up to 10 tasks (to cope better with
massive mapper numbers, spawning jvm for every 64M worth of input is way too
expensive).
{quote}Yes good move on standardizing commons-math dependency. That makes
sense.{quote}
in fact we discussed it somewhere already . This is partly due to the fact that
Mahout doesn't follow maven (somewhat convoluted imo) doctrine where you'd have
to declare all you use in a parent under <dependencyManagement> _with_ versions
and actual module would not have _versioned_ dependency. That achieves two main
things:
* all modules end up using the same version at runtime as they were compiled
with;
* the projects that wish to embed this code base, could import those
dependencies into their maven build by using <scope>import</scope> spec. (one
can't import transitive dependencies, only those in <dependencyManagement> ones
which is a problem with their approach imo). So... Ted was suggesting we create
a separate issue devoted to bring it all up in agreement to that maven ideology
(clean out versions from dependencies in modules and move all of them under
<dependencyManagement> in parent pom which is partly followed now, but not
everywhere, as the case with math-commons demostrates). I was planning to look
at it more closely and suggest a patch at a later time.
{quote}My only structural complaint is that this is not really using
AbstractJob. I see the idea is to make several phases of the job runnable
independently. Is that a realistic use case? Usually you run all the phases to
get meaningful work done (perhaps with the option of restarting from phase n,
to recover from failure – but that's handled already in AbstractJob).{quote}
I wouldn't describe it as "giving up".
It's just we have MR pipelines that are closely coupled (i.e. expect predefined
intermediate results, have common postprocessing to move/rearrange files, which
is certainly the case in this case). This is not exclusive to Mahout, such
pipelines pop up everywhere MR is used. So, for tightly coupled MR steps, IMO
it doesn't make sense to generate arg[] and then reparse them using a CLI
parser. It's too much hassle for something that is never intended to be used as
a standalone job.
On the other hand, for every job that we feel is uncoupled enough so that we
are compelled to create its own CLI, sure, let's use AbstractJob and
ToolRunner. I can only vote for it with both hands. In fact, i am all for
having it as a standard. Got a CLI?-- use AbstractJob, that's the rule here. No
CLI and never will be? -- single driver works better then.
> Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure
> compatibility with current Mahout dependencies.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-593
> URL: https://issues.apache.org/jira/browse/MAHOUT-593
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.4
> Reporter: Dmitriy Lyubimov
> Fix For: 0.5
>
> Attachments: MAHOUT-593.patch.gz, MAHOUT-593.patch.gz,
> MAHOUT-593.patch.gz, SSVD-givens-CLI.pdf
>
>
> Current Mahout-376 patch requries 'new' hadoop API. Certain elements of that
> API (namely, multiple outputs) are not available in standard hadoop 0.20.2
> release. As such, that may work only with either CDH or 0.21 distributions.
> In order to bring it into sync with current Mahout dependencies, a backport
> of the patch to 'old' API is needed.
> Also, some work is needed to resolve math dependencies. Existing patch relies
> on apache commons-math 2.1 for eigen decomposition of small matrices. This
> dependency is not currently set up in the mahout core. So, certain snippets
> of code are either required to go to mahout-math or use Colt eigen
> decompositon (last time i tried, my results were mixed with that one. It
> seems to produce results inconsistent with those from mahout-math
> eigensolver, at the very least, it doesn't produce singular values in sorted
> order).
> So this patch is mainly moing some Mahout-376 code around.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira