The internal state of pagerank is just the rank vector which would be 100k times 1.
--sebastian Am 17.06.2014 17:56 schrieb "Yingjun Wu" <[email protected]>: > Hi Stephan, > > Thanks for your quick reply. > > Let's just consider a simple iterative processing algorithm, say, pagerank. > If we wanna compute the pagerank value for 100k webpages, then the internal > state, if represent the graph as matrix, should be at least 100k * 100k * 8 > bytes=74GB, which obviously exceeds the memory size of a single commodity > machine. GraphLab does not suffer from this problem as it stores the graph > in distributed memory. So in this case, do you think it is necessary to > distributed the internal state to multiple machines? > > Regards, > Yingjun > > > On Tue, Jun 17, 2014 at 9:27 PM, Stephan Ewen <[email protected]> wrote: > > > Hi Yingjun! > > > > Thanks for pointing this out. I would like to learn a bit more about the > > problem you have, so let me ask you a few questions to make sure I > > understand the matter in more detail... > > > > I assume you are thinking abut programs that run a single reduce function > > in the end, "all reduce" , rather than a "by-key-reduce"? And the size of > > the data in that "all-reduce" is too large for the main memory? > > > > In many cases, you can do a "combine" step (like a pre-reduce) before the > > final reduce function. Because there are many parallel instances of the > > "combine", that one has much more memory available, and is often able to > > shrink the size of the data such that the final reduce function works on > a > > data set small enough to fit in memory. In flink, you get that "combine" > > step automatically when you use the ReduceFunction, or when you annotate > a > > GroupReduceFunction with "@Combinable". > > > > Sometimes, such a "combine" behavior cannot be applied to a reduce > > function, when an operation cannot be executed on a sub-part of the data. > > The case you are mentioning is such a case? > > > > We have not looked into maintaining a distributed state that can be > > accessed across machines. Sometimes that may be helpful, I agree. It > would > > mean that some data accesses have quite a latency, if they cross the > > network, but that may be acceptable for certain applications. How would > you > > imagine such a feature to look like? Could you give us an example of what > > you would design it like? > > > > Greetings, > > Stephan > > >
