[ 
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966333#action_12966333
 ] 

Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------

{quote}
Sounds to me like the reducer could replicate the combiner and thus implement 
the second step of your hierarchy which would avoid the second MR pass. You 
could have a single reducer which receives all combiner outputs and thus merge 
everything.
{quote}

First of all, second level hierarchical merge  is done in Bt job mappers -- so 
moving this code into reducer wouldn't actually really win anything in terms of 
actual IO or # of MR passes.

Secondly,  I don't beleive single reducer pass is efficient. All 
parallelization is gone, isn't it? But the problem is not even that but 
limitation on how many Rs you can preload for the merges into same process.  Rs 
are preloaded as a side info (and we relinquish one R with every Q we process 
in combiner, so initially it loads all but then throws them away one by one. 
The catch is that each Q update has to complete the merges of the reminder of 
the R sequence. That's described in computeQHatSequence algorithm in my notes. 

Finally, there's no need to sort. When we do R merge, as it is shown in the 
notes, we have to revisit  Q blocks (and Rs) exactly in the same order we just 
produced them in. Problem is that when we revisit a Q block, we have to have 
the tail of all subsequent R blocks handy.  So even if we sent them to reducers 
(which was my initial plan), we'd have to sort them by task id they came from 
and then by their order inside the task that produced them. And then we might 
need to duplicate R traffic to every reducer process. Huge waste, again.

{quote}
 Since you can't guarantee that the combiner does any work, this is best 
practice anyway. The specification is that the combiner will run zero or more 
times.
{quote}

That  actually is absolutely valid and i was very concerned about it in the 
beginning. I expected it but it did not turn up in tests so far, so that issue 
was  slowly glimmering in my subconsciousness ever since.  

If that actually is a problem then yes I might be forced to extend Q pass to 
include reducer. Which IMO would be a humongous exercise in IO bandwidth waste. 
there's no need for that. there's only need a for local merge of R sequences 
and doing second pass over q block sequence  you just generated, in the same 
order you generated it. Even a combiner is an overshoot for that, if i were 
implementing it in a regular parallel batching way, since there's no need to 
reorder q blocks. 

I also considered that If i am forced to address that i can still do it in the 
mapper only without even a combiner by pushing Q blocks to local FS side files 
and doing a second pass over them and that'd be much more efficient than 
pushing them to another networked process. Not mentioning all the unneeded 
sorting involved in between and all the armtwisting i was struggling with 
involved in sending them in same task  chunks, same order. There's still 
rudimentary remainder of that attempt in the code (the key that carries task id 
and order id for Q block that are never subsequently used anywhere). But saving 
local side file in mapper and do second pass over it is still way more 
attractive than reducer approach if i need to fix this.

{quote}
This also raises the question of whether your combiner can be applied multiple 
times. I suspect yes. You will know better than I.
{quote}

Yes, that's the optional hierarchy increment that  i was talking about, that we 
can additionally implement if billion for m  and  unbound n is not good enough. 
 



> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for 
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR 
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, 
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working 
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, 
> ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz, ssvd-m2.patch.gz, 
> ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to