[
https://issues.apache.org/jira/browse/GIRAPH-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488111#comment-13488111
]
Eli Reisman commented on GIRAPH-388:
------------------------------------
Exactly, the idea of splitting hubs and neighborhoods is referring to the same
kind of problem I was referencing above about message duplication where a
supernode belongs to a given parition on a worker and a lot of vertices on
other workers have out eddges to that supernode. The lumpiness of the social
graph data makes Giraph behave very differently than running benchmarks
configured to same scale of input data size.
I also agree about edge-based partitioning as a good idea for balancing the
social graph data it already came in really handy for me earlier last summer
while working on the input superstep. This was also a flushing issue, in which
measuring outgoing graph partition data by # of vertices per flush rather than
# of edges was resulting in workers crashing when they read or were assigned a
supernode or two and tried to read/write them to the wire. An outgoing buffer
with a supernode in it (and many many out-edges) was so much bigger than a
buffer of typical-sized vertices it was crashing the IPC. Tuning the flushing
there was critical to scaling Giraph up under the memory constraints I was
trying to meet. The GIRAPH-232 metrics with Graphite graphs were very
illustrative as to how different the benchmark and social data made the
framework behave as a job ran.
As you said before, messaging is a different situation. If you think the
flushing and/or deduplication isn't going to help save memory per-worker, I'm
happy to shift focus to where the good solutions are. If you think the
deduplication issue can be addressed better another way, that sounds good too.
I'd love to see more ideas (and more fleshed out ideas) on the mailing list
about how some of you who know a lot about this subject but don't have a lot of
time to code up an example would attack these problems. There are a number of
us who are happy to try to code up a good idea, and not afraid to go down a
blind alley with you to see if something works. Many graph tools I've reviewed
for ideas seem top-to-bottom optimized for particular uses. Giraph is a more
general framework. Are there some existing solutions you've seen out there we
should be looking at or emulating to solve some of these problems?
> Improve the way we keep outgoing messages
> -----------------------------------------
>
> Key: GIRAPH-388
> URL: https://issues.apache.org/jira/browse/GIRAPH-388
> Project: Giraph
> Issue Type: Improvement
> Reporter: Maja Kabiljo
> Assignee: Maja Kabiljo
> Attachments: GIRAPH-388.patch
>
>
> As per discussion on GIRAPH-357, in standard application chances that we get
> to use client-side combiner are very low. I experimented with benefits which
> we can get from not having the client-side combiner at all. It turns out that
> having a lot of maps in SendMessageCache, and then collection inside each of
> them, really hurts the performance.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira