OK. Let's take that as read, then. On Mon, Dec 13, 2010 at 11:27 AM, Jake Mannix <[email protected]> wrote:
> Ted, there are some vectors where it certainly matters: the eigenvectors of > a > really big matrix are dense, and thus take O(8*num_vertices) bytes to hold > just > one of them in memory. Doing something sequential with these can certainly > make sense, and in some cases is actually necessary, esp. if done in the > mappers or reducers where there is less memory than you usually have... > > -jake > > On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[email protected]> > wrote: > > > I really don't see this as a big deal even with crazy big vectors. > > > > Looking at web scale, for instance, the most linked wikipedia article > only > > has 10 million in-links or so. On the web, the most massive web site is > > unlikely to have >100 million in-links. Both of these fit in very modest > > amounts of memory. > > > > Where's the rub? > > > > On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[email protected] > > >wrote: > > > > > Jake, > > > No i was trying exactly what you were proposing some time ago on the > > list. > > > I > > > am trying to make long vectors not to occupy a lot of memory. > > > > > > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am > > > saying, hey, there's a lot of sequential techniques that can provide a > > > hander that would inspect vector element-by-element without having to > > > preallocate 8Mb. > > > > > > for 1 million-long vectors it doesn't scary too much but starts being > so > > > for > > > default hadoop memory settings at the area of 50-100Mb (or 5-10 million > > > non-zero elements). Stochastic SVD will survive that, but it means less > > > memory for blocking, and the more blocks you have, the more CPU it > > requires > > > (although CPU demand is only linear to the number of blocks and only in > > > signficantly smaller part of computation, so that only insigificant > part > > of > > > total CPU flops depends on # of blocks, but there is part that does, > > still. > > > ) > > > > > > Like i said, it also would address the case when rows don't fit in the > > > memory (hence no memory bound for n of A) but the most immediate > benefit > > is > > > to speed/ scalability/memory req of SSVD in most practical LSI cases. > > > > > > -Dmitriy > > > > > > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[email protected]> > > > wrote: > > > > > > > Hey Dmitriy, > > > > > > > > I've also been playing around with a VectorWritable format which is > > > backed > > > > by a > > > > SequenceFile, but I've been focussed on the case where it's > essentially > > > the > > > > entire > > > > matrix, and the rows don't fit into memory. This seems different > than > > > your > > > > current > > > > use case, however - you just want (relatively) small vectors to load > > > > faster, > > > > right? > > > > > > > > -jake > > > > > > > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[email protected] > > > > > > wrote: > > > > > > > > > Interesting idea. > > > > > > > > > > Would this introduce a new vector type that only allows iterating > > > through > > > > > the elements once? > > > > > > > > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I would like to submit a patch to VectorWritable that allows for > > > > > streaming > > > > > > access to vector elements without having to prebuffer all of them > > > > first. > > > > > > (current code allows for the latter only). > > > > > > > > > > > > That patch would allow to strike down one of the memory usage > > issues > > > in > > > > > > current Stochastic SVD implementation and effectively open memory > > > bound > > > > > for > > > > > > n of the SVD work. (The value i see is not to open up the the > bound > > > > > though > > > > > > but just be more efficient in memory use, thus essentially > speeding > > u > > > p > > > > > the > > > > > > computation. ) > > > > > > > > > > > > If it's ok, i would like to create a JIRA issue and provide a > patch > > > for > > > > > it. > > > > > > > > > > > > Another issue is to provide an SSVD patch that depends on that > > patch > > > > for > > > > > > VectorWritable. > > > > > > > > > > > > Thank you. > > > > > > -Dmitriy > > > > > > > > > > > > > > > > > > > > >
