Stratosphere even employs its own memory management by serializing data to preallocated byte arrays. This does not only allow for a very compact representation of the data, but also avoids major GC pauses and allows different buffer implementations to gracefully spill to disk.
On 17.08.2012 17:17, Claudio Martella wrote: > Yes, that is definitely the direction you may want to take at a certain > moment. That is basically what Stanford gps does as well, and stratosphere > too. > > On Friday, August 17, 2012, Alessandro Presta wrote: > >> I think at that point it would be worth having a new logical place for >> vertex/edge representation at worker- or partition-level. >> Avery had some ideas about this. >> >> Basically right now we're giving the user the freedom (and responsibility) >> to choose a representation (both in-memory and for serialization), but >> another way to go would be to take care of all that at infrastructure >> level and expose only one Vertex class (where the user only defines the >> computation details and everything else is abstracted away). Then we could >> play around with compact representations and even more disruptive >> strategies (like streaming the graph/messages and re-using objects). >> >> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" >> <[email protected]<javascript:;> >>> >> wrote: >> >>> I was under the impression that 100k was the upper limit to make things >>> work without crashing. >>> >>> In any case, if one wanted to use a compressed memory representation by >>> aggregating different edge lists together, could one use the worker >>> context >>> as a central point of access to the compressed graphs? >>> I can imagine a vertex class that has only the ID and uses the worker >>> context to access its edge list (i.e. it is only a client to a central >>> per-machine repository). >>> Vertexes in the same partition would share this data structure. >>> >>> Is there any obvious technical fallacy in this scheme? >>> >>> Cheers, >>> -- >>> Gianmarco >>> >>> >>> >>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta >>> <[email protected]>wrote: >>> >>>> The example where we actually go out of memory was with 500K vertices >>>> and >>>> 500M edges, but yes, as a general rule we should strive to reduce our >>>> memory footprint in order to push the point where we need to go out of >>>> core as far away as possible. >>>> >>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <[email protected]> >>>> wrote: >>>> >>>>> Very interesting. >>>>> >>>>> On a side note, a graph with 100k vertices and 100M edges is largish >>>> but >>>>> not that big after all. >>>>> If it does not fit on 10+ GB of memory, it means that each edge >>>> occupies >>>>> around 100B (amortizing the cost of the vertex over the edges). >>>>> In my opinion this deserves some thought. >>>>> If memory is an issue, why not think about compressed memory >>>> structures, >>>>> at >>>>> least for common graph formats? >>>>> >>>>> Cheers, >>>>> -- >>>>> Gianmarco >>>>> >>>>> >>>>> >>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman >>>>> <[email protected]>wrote: >>>>> >>>>>> Great metrics, this made a very interesting read, and great code too >>>> as >>>>>> always. This must have been a lot of work. I like the idea of >>>>>> eliminating >>>>>> the extra temporary storage data structures where possible, even when >>>>>> not >>>>>> going out-of-core. I think that + avoiding extra object creation >>>> during >>>>>> the >>>>>> workflow can still do a lot for in-core job's memory profile, but >>>> this >>>>>> is >>>>>> looking really good and sounds like with the config options its also >>>>>> pluggable depending on your hardware situation, so it sounds great to >>>>>> me. >>>>>> Great work! >>>>>> >>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA) >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> >>>>>>> [ >>>>>>> >>>>>> >>>> >>>>>> >> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir >>>>>> a >>>> . >>>> >>>>>> plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437 >>>>>> #c >>>>>> omment-13435437 >>>>>> ] >>>>>>> >>>>>>> Alessandro Presta commented on GIRAPH-249: >>>>>>> ------------------------------------------ >>>>>>> >>>>>>> Thanks Claudio, good observation. >>>>>>> You got me curious so I quickly ran a shortest paths benchmark. >>>>>>> >>>>>>> 500k vertices, 100 edges/vertex, 10 workers >>>>>>> >>>>>>> This is with trunk: >>>>>>> >>>>>>> {code} >>>>>>> hadoop jar giraph-trunk.jar >>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark >>>>>> -Dgiraph.useN > > >
