[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Eli Reisman (JIRA) Sun, 19 Aug 2012 18:23:39 -0700

     [ 
https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-6.patch

This patch has been well tested up to the largest scales we use here, and 
functions even better than 306-5. I feel I should explain it as its methods 
might be a bit controversial.

A large data load under our constraints here took a certain 4 figure number of 
workers, and over 75 minutes without locality. With locality, this was reduced 
to 20 minutes and hundreds less workers required. With 301-5, this is lowered 
to 400 less workers and 15 minutes. Using this patch, this same data load in 
takes under 4 minutes, and 400 less workers than the original job.

The controversial part is this: as stated in earlier posts on this thread, 
instrumented runs while experimenting with scale this weekend have revealed 
that even speeding up data load in using locality and other changes in the 301 
patches does not end the INPUT_SUPERSTEP as soon as it could, or completely 
eliminate the "clumping" effect described above.

The reason for this clumping turns out to be, that while ZK can handle large 
read throughput, the quorum must sync after writes before servicing a backup of 
many concurrent reads. Since both the FINISHED and RESERVED znode lists are 
being queries in all iterations on every worker, and also being mutated as 
splits are claimed and completed, the workers that never get a split are not 
sleeping throughout the input step, but in fact very, VERY slowly iterating 
their input split list. In some cases, the step ends before they have finished 
one single iteration, even if the input superstep goes on for 30 or more 
minutes!

This patch (301-6) dramatically speeds this up by removing the checks for 
FINSIHED znodes. The nodes are still created whenever a split is finished by a 
worker, so that the master knows when to end the barrier and begin the first 
calculation superstep. There is no danger of BSP barriers being tampered with. 
Further, every worker must read the whole list of splits at least once from the 
top and register every node as RESERVED before it stops trying to read any 
additional splits. Therefore, if a worker dies in mid-read, its ephemeral 
RESERVED node disappears, and others could possibly claim it, since every node 
must still do one full iteration on the list finding all splits RESERVED before 
ending its search for good and waiting on the barrier for superstep 0.

This means that the only danger of data loss would be if the very last worker 
to iterate fails during a split read. In this case, the next superstep will 
never come (as the split is never marked FINISHED) and the job fails anyway. If 
a worker dies in Giraph after marking a split FINISHED there is currently no 
algorithm in place to restore order to the calculation, even if the worker 
could restart and recover, so no harm done on this common failure by the 
changes here.

In actual fact, the real story is any worker failing and restarting during the 
INPUT_SUPERSTEP currently causes cascading failure to the job. Until we have a 
more comprehensive plan for worker failure of this sort, there is no danger 
whatsoever to this large optimization in the network load and speed during 
input superstep that comes by having the workers evaluate whether to keep 
iterating on the input list based on every split being RESERVED rather than 
FINISHED. I have added comments to BspServiceWorker#reserveInputSplit() where 
the changes are coded to annotate that, should the recovery story for Giraph 
change in the future, this algorithm optimization should be revisited.

Again, I have run this to happy completion many times today and can vouch that 
it causes no problems for Giraph as-is. If everyone is comfortable with this 
change, I think the reduced cost to network (literally cuts ZK reads from all 
workers during input phase in half) and the reduced time to finish the 
superstep are well worth it.


                
> InputSplit Reservations are clumping, leaving many workers asleep while other 
> process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, 
> GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers 
> are able to load input splits extremely quickly, and this has altered the 
> behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm 
> for split reservations. A few workers process multiple splits (often 
> overwhelming Netty and getting GC errors as they attempt to offload too much 
> data too quick) while many (often most) of the others just sleep through the 
> superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its 
> reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and 
> only wake up if another worker finishes a split, then contend with that 
> worker for another split, while the majority of the split list might sit 
> idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK 
> reads are cheap, only writes are not) this patch is able to get every worker 
> involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes 
> quickly and painlessly, and without overwhelming Netty by spreading the 
> memory load the split readers bear more evenly. If the giraph.splitmb and -w 
> options are set correctly, behavior is now exactly as one would expect it to 
> be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the 
> INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Reply via email to