Hi Maja, I didn't want to minimize your effort which is great, I just wanted to underline the complete overlap between the previous plan and your work. The designed had moved from BTrees+Bloomfilters towards sorted partitions, as in the same discussion (17/Dec/11 15:28) and as presented at the Berlin workshop on the 6th of june (http://prezi.com/ecdgiav4oeex/bb-apache-giraph-distributed-graph-processing-in-the-cloud/). It really looks to me like your contribution is one of the best ways to tackle the problem.
Best, Claudio On Wed, Aug 1, 2012 at 8:43 PM, Maja Kabiljo <[email protected]> wrote: > Another thing, I think I should explain what from GIRAPH-45 discussion am > I actually using here, since I don't use bloomfilters and BTrees. The way > it works is the following: > - Inside the outer message store we have message stores for each of the > partitions separately. > - Partition message stores keep data in ordered map (ordered by vertex id). > - In outer messages store we check if we should flush something (do we > have more than allowed number of messages in memory). While we do, we > flush the partition with largest number of messages in memory. > - When partition messages store is flushed, all the data is written to a > file in the order of vertex ids, file content is like: > num_vertices > vertex_1_id num_messages_1 message_1_1 message_1_2 ... > vertex_2_id num_messages_2 message_2_1 message_2_2 ... > ... > - In the end each partition will have some messages in memory, and N > files, where N is the number of times it was flushed. > - When it's time to do the computation, within a single partition we call > compute methods in order of vertex ids. > - We use buffered streams and read data from all partition files > sequentially, since we'll need data in the same order it's written in each > of the files. This way we limit number of random file accesses. -- Claudio Martella [email protected]
