Agreed, we should (and will be) moving more along these directions (byte
arrays / ByteBuffer) in the future for Giraph.
On 8/17/12 8:36 AM, Sebastian Schelter wrote:
Stratosphere even employs its own memory management by serializing data
to preallocated byte arrays. This does not only allow for a very compact
representation of the data, but also avoids major GC pauses and allows
different buffer implementations to gracefully spill to disk.
On 17.08.2012 17:17, Claudio Martella wrote:
Yes, that is definitely the direction you may want to take at a certain
moment. That is basically what Stanford gps does as well, and stratosphere
too.
On Friday, August 17, 2012, Alessandro Presta wrote:
I think at that point it would be worth having a new logical place for
vertex/edge representation at worker- or partition-level.
Avery had some ideas about this.
Basically right now we're giving the user the freedom (and responsibility)
to choose a representation (both in-memory and for serialization), but
another way to go would be to take care of all that at infrastructure
level and expose only one Vertex class (where the user only defines the
computation details and everything else is abstracted away). Then we could
play around with compact representations and even more disruptive
strategies (like streaming the graph/messages and re-using objects).
On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales"
<[email protected]<javascript:;>
wrote:
I was under the impression that 100k was the upper limit to make things
work without crashing.
In any case, if one wanted to use a compressed memory representation by
aggregating different edge lists together, could one use the worker
context
as a central point of access to the compressed graphs?
I can imagine a vertex class that has only the ID and uses the worker
context to access its edge list (i.e. it is only a client to a central
per-machine repository).
Vertexes in the same partition would share this data structure.
Is there any obvious technical fallacy in this scheme?
Cheers,
--
Gianmarco
On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
<[email protected]>wrote:
The example where we actually go out of memory was with 500K vertices
and
500M edges, but yes, as a general rule we should strive to reduce our
memory footprint in order to push the point where we need to go out of
core as far away as possible.
On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <[email protected]>
wrote:
Very interesting.
On a side note, a graph with 100k vertices and 100M edges is largish
but
not that big after all.
If it does not fit on 10+ GB of memory, it means that each edge
occupies
around 100B (amortizing the cost of the vertex over the edges).
In my opinion this deserves some thought.
If memory is an issue, why not think about compressed memory
structures,
at
least for common graph formats?
Cheers,
--
Gianmarco
On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
<[email protected]>wrote:
Great metrics, this made a very interesting read, and great code too
as
always. This must have been a lot of work. I like the idea of
eliminating
the extra temporary storage data structures where possible, even when
not
going out-of-core. I think that + avoiding extra object creation
during
the
workflow can still do a lot for in-core job's memory profile, but
this
is
looking really good and sounds like with the config options its also
pluggable depending on your hardware situation, so it sounds great to
me.
Great work!
On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
<[email protected]>wrote:
[
https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
a
.
plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
#c
omment-13435437
]
Alessandro Presta commented on GIRAPH-249:
------------------------------------------
Thanks Claudio, good observation.
You got me curious so I quickly ran a shortest paths benchmark.
500k vertices, 100 edges/vertex, 10 workers
This is with trunk:
{code}
hadoop jar giraph-trunk.jar
org.apache.giraph.benchmark.ShortestPathsBenchmark
-Dgiraph.useN