Hi Claudio, It's fine, I wasn't trying to take credits for the things you and others came up with, I'm sorry if it sounded that way to you. Since there were a lot of great ideas in the discussion, I just wanted to make clear which things I am using and which not, so one wouldn't have to go through the whole code to find that out.
Maja On 8/24/12 7:23 PM, "Claudio Martella" <[email protected]> wrote: >Hi Maja, > >I didn't want to minimize your effort which is great, I just wanted to >underline the complete overlap between the previous plan and your >work. The designed had moved from BTrees+Bloomfilters towards sorted >partitions, as in the same discussion (17/Dec/11 15:28) and as >presented at the Berlin workshop on the 6th of june >(http://prezi.com/ecdgiav4oeex/bb-apache-giraph-distributed-graph-processi >ng-in-the-cloud/). >It really looks to me like your contribution is one of the best ways >to tackle the problem. > >Best, >Claudio > >On Wed, Aug 1, 2012 at 8:43 PM, Maja Kabiljo <[email protected]> wrote: >> Another thing, I think I should explain what from GIRAPH-45 discussion >>am >> I actually using here, since I don't use bloomfilters and BTrees. The >>way >> it works is the following: >> - Inside the outer message store we have message stores for each of the >> partitions separately. >> - Partition message stores keep data in ordered map (ordered by vertex >>id). >> - In outer messages store we check if we should flush something (do we >> have more than allowed number of messages in memory). While we do, we >> flush the partition with largest number of messages in memory. >> - When partition messages store is flushed, all the data is written to a >> file in the order of vertex ids, file content is like: >> num_vertices >> vertex_1_id num_messages_1 message_1_1 message_1_2 ... >> vertex_2_id num_messages_2 message_2_1 message_2_2 ... >> ... >> - In the end each partition will have some messages in memory, and N >> files, where N is the number of times it was flushed. >> - When it's time to do the computation, within a single partition we >>call >> compute methods in order of vertex ids. >> - We use buffered streams and read data from all partition files >> sequentially, since we'll need data in the same order it's written in >>each >> of the files. This way we limit number of random file accesses. > > > >-- > Claudio Martella > [email protected]
