Here are my thoughts so far:

http://dl.dropbox.com/u/36863361/sd-2.pdf

and tex source:

http://dl.dropbox.com/u/36863361/sd-2.tex

I think that this gets rid of the QR steps.  I am still debugging the case
of a singular matrix, but that shouldn't apply to any real cases.

On Wed, Aug 17, 2011 at 12:01 PM, Dmitriy Lyubimov <[email protected]>wrote:

> I will take a look although there seem to be a lot of new stuff i don't
> have
> time to read the science for.
>
> On top of it, i was planning some improvements on SSVD scaling and getting
> rid of current limitations for some time now, such as
>
> -- SSVD-wide enhancements: to allow better wide scaling, in summary to
> billions of non-zero elements per row:
>    -- remove at least k+p rows per map task limiation without causing
> "supersplits" by allowing blocked QR  pushdown to reducers (or perhaps even
> automatic pushdown, i am not sure if it is possible).
>    -- I have already used SSVD code that equips vector with a preprocessor
> via Configured hadoop interface allowing on-the fly random projection which
> allows to randomly project very long rows without ever loadnig them in
> memory
>
> -- "SSVD-tall" improvements: to allow more vertical scaling (currently
> thought to be at about billion rows with a lot of memory) by introducing
> more bottom-up divide-and-conquer QR steps in the middle.
>
> Unfortunately, i see most of those improvements (except for preprocessor
> improvement probably, and perhaps QR pushdown) as purely theoretical
> challenge as i am yet to find a use case for them either myself or in
> public, hence it is merely a theoretical scale interest right now. Dense
> matrix even of million by million is already 5 to 8 Tb input file, which is
> a challenge to find for me, much less benchmark on a thousand-node cluster,
> and this case is thought to be already well covered even by current code.
> Potential challenge to it is high deviation of nonzero elements in the
> input
> (so that it may be million on average with spikes to a billion or so which
> would mean a 8G sized vector).
>
> Given i seem to be burried  in ever-increasing work and household tasks, i
> don't see myself doing much of that except for what improvements already
> exist on the side, in the next 6 months or so.
>
> -d
>
> On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <[email protected]> wrote:
>
> > Hi all, I'm again seeing the issue count tend to pile up. I try to run
> > through regularly to resolve anything addressed to me, and even things
> that
> > aren't but that I am confident enough to fix. It would be great if
> everyone
> > could do the same in a spare 1-2 hours this week, if only to say "yes, go
> > ahead on that patch" or "no I don't think this is a good idea".
> Especially
> > the committers who have not been active in a while.
> >
> > To me, this is the most essential work we can do, because without
> responses
> > from those with power to commit, new community members get the message
> that
> > their contributions are ignored, or that nobody's home. That's no good.
> > Understanding that individuals may not have time to actively write their
> > own
> > new changes and improvements, it seems that the least we can all do is
> > involve and respond to external input, to bring in those who want to make
> > changes.
> >
> > I'd also like to sweep through the issues that have not been touched in
> 6+
> > months and close some that just do not seem to be getting any traction or
> > attention. The theory is that closing stuff that by all accounts won't
> get
> > looked at better communicates what's coming in the project, and focuses
> > attention on issues that might get looked at.
> >
> > Before I start that though, would welcome anyone to peek at everything
> > that's open and assign, comment, ping, etc. anything that needs to be
> kept
> > alive.
> >
>

Reply via email to