[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Eli Reisman (JIRA) Wed, 15 Aug 2012 09:48:39 -0700

    [ 
https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435284#comment-13435284
 ]


Eli Reisman commented on GIRAPH-301:
------------------------------------

Hey Alessandro,

Yes I think on 4, that was the idea, and it makes sense. Problem is, if you 
only check one possibly available node before deciding to sleep, and then start 
reading the list again from index 0 of the split list, you contend with other 
workers whenever they finish a node and wake you up to keep iterating. Whoever 
gets the first open slot, the others fail to claim it and go back to sleep 
instead of continuing to iterate.

Worse, when you read big data on each node and they take a long time, other 
nodes time out every minute or so and jump back in to attempt to claim a node. 
Awakened nodes iterate again, and since many other nodes are reading big splits 
already, the first one they encounter that has a RESERVED split, they don't 
claim it successfully, and back to sleep they go. So you're back to this 
problem of everyone (including workers who finish a split and try to iterate 
for a new one) going back to sleep way too eagerly. I have seen this behavior 
happening no matter how I set splitmb and -w since I started using Giraph, and 
I have been puzzled why I couldn't trick some (often many) workers into doing 
something when there was enough work to go around.

Users here started emailing about this clumping effect, and I had noticed it 
many times over the last few months. The situation I describe above is with the 
new locality patch making some workers read very fast (and overload trying to 
send out all the data as they pick up new splits like crazy) but this clumping 
of split-reading activity and groups of workers sleeping through the whole 
input phase has been happening as long as I've been using Giraph.

My cluster is down this morning for upgrades but but I hope to be back up and 
running this afternoon/tonight. The tests of this I ran before putting the 
patch up worked well: I could get just the behavior I had always expected by 
doing

 (# of MB of data) / (giraph.splitmb) == (# of workers you should see busy 
right away reading splits, if you select that many or more with -w) 

Which is, 1 split per worker right from the get-go. Other manipulations of the 
formula obviously split out the way one would expect when skewing in favor of 
extra splits or extra workers (i.e. no clumping when 50 workers, 100 splits -- 
almost all read 2 splits, not some reading 3-4 and some reading 0 like before)

So it comes down to your first point: is it bad to load up the zookeeper quorum 
with potentially reads like this? After reading both ZK papers and having this 
problem to think about when I added the locality patch, my opinion is "no" this 
is what ZK is absolutely designed for. Having a quorum of ZK's to split the 
read requests definitely helps, but on most clusters this is a minimum of 3 
servers. This does bear more testing, of course.

The time when slowing or problems can happen is during writes. This patch tries 
to at least mitigate that a bit by not bothering to try to create the claim 
node unless we have a hint that the node is not already created. This will not 
be useful on the first pass when everyone is vying for nodes, but after any 
awakening from sleep, it is quite likely since, as of the locality patch, many 
work's split lists are not ordered the same any more and they may not encounter 
the same unclaimed nodes right away as they iterate.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other 
> process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers 
> are able to load input splits extremely quickly, and this has altered the 
> behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm 
> for split reservations. A few workers process multiple splits (often 
> overwhelming Netty and getting GC errors as they attempt to offload too much 
> data too quick) while many (often most) of the others just sleep through the 
> superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its 
> reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and 
> only wake up if another worker finishes a split, then contend with that 
> worker for another split, while the majority of the split list might sit 
> idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK 
> reads are cheap, only writes are not) this patch is able to get every worker 
> involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes 
> quickly and painlessly, and without overwhelming Netty by spreading the 
> memory load the split readers bear more evenly. If the giraph.splitmb and -w 
> options are set correctly, behavior is now exactly as one would expect it to 
> be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the 
> INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Reply via email to