Here are my thoughts so far: http://dl.dropbox.com/u/36863361/sd-2.pdf
and tex source: http://dl.dropbox.com/u/36863361/sd-2.tex I think that this gets rid of the QR steps. I am still debugging the case of a singular matrix, but that shouldn't apply to any real cases. On Wed, Aug 17, 2011 at 12:01 PM, Dmitriy Lyubimov <[email protected]>wrote: > I will take a look although there seem to be a lot of new stuff i don't > have > time to read the science for. > > On top of it, i was planning some improvements on SSVD scaling and getting > rid of current limitations for some time now, such as > > -- SSVD-wide enhancements: to allow better wide scaling, in summary to > billions of non-zero elements per row: > -- remove at least k+p rows per map task limiation without causing > "supersplits" by allowing blocked QR pushdown to reducers (or perhaps even > automatic pushdown, i am not sure if it is possible). > -- I have already used SSVD code that equips vector with a preprocessor > via Configured hadoop interface allowing on-the fly random projection which > allows to randomly project very long rows without ever loadnig them in > memory > > -- "SSVD-tall" improvements: to allow more vertical scaling (currently > thought to be at about billion rows with a lot of memory) by introducing > more bottom-up divide-and-conquer QR steps in the middle. > > Unfortunately, i see most of those improvements (except for preprocessor > improvement probably, and perhaps QR pushdown) as purely theoretical > challenge as i am yet to find a use case for them either myself or in > public, hence it is merely a theoretical scale interest right now. Dense > matrix even of million by million is already 5 to 8 Tb input file, which is > a challenge to find for me, much less benchmark on a thousand-node cluster, > and this case is thought to be already well covered even by current code. > Potential challenge to it is high deviation of nonzero elements in the > input > (so that it may be million on average with spikes to a billion or so which > would mean a 8G sized vector). > > Given i seem to be burried in ever-increasing work and household tasks, i > don't see myself doing much of that except for what improvements already > exist on the side, in the next 6 months or so. > > -d > > On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <[email protected]> wrote: > > > Hi all, I'm again seeing the issue count tend to pile up. I try to run > > through regularly to resolve anything addressed to me, and even things > that > > aren't but that I am confident enough to fix. It would be great if > everyone > > could do the same in a spare 1-2 hours this week, if only to say "yes, go > > ahead on that patch" or "no I don't think this is a good idea". > Especially > > the committers who have not been active in a while. > > > > To me, this is the most essential work we can do, because without > responses > > from those with power to commit, new community members get the message > that > > their contributions are ignored, or that nobody's home. That's no good. > > Understanding that individuals may not have time to actively write their > > own > > new changes and improvements, it seems that the least we can all do is > > involve and respond to external input, to bring in those who want to make > > changes. > > > > I'd also like to sweep through the issues that have not been touched in > 6+ > > months and close some that just do not seem to be getting any traction or > > attention. The theory is that closing stuff that by all accounts won't > get > > looked at better communicates what's coming in the project, and focuses > > attention on issues that might get looked at. > > > > Before I start that though, would welcome anyone to peek at everything > > that's open and assign, comment, ping, etc. anything that needs to be > kept > > alive. > > >
