[
https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eli Reisman updated GIRAPH-301:
-------------------------------
Attachment: GIRAPH-301-6.patch
This patch has been well tested up to the largest scales we use here, and
functions even better than 306-5. I feel I should explain it as its methods
might be a bit controversial.
A large data load under our constraints here took a certain 4 figure number of
workers, and over 75 minutes without locality. With locality, this was reduced
to 20 minutes and hundreds less workers required. With 301-5, this is lowered
to 400 less workers and 15 minutes. Using this patch, this same data load in
takes under 4 minutes, and 400 less workers than the original job.
The controversial part is this: as stated in earlier posts on this thread,
instrumented runs while experimenting with scale this weekend have revealed
that even speeding up data load in using locality and other changes in the 301
patches does not end the INPUT_SUPERSTEP as soon as it could, or completely
eliminate the "clumping" effect described above.
The reason for this clumping turns out to be, that while ZK can handle large
read throughput, the quorum must sync after writes before servicing a backup of
many concurrent reads. Since both the FINISHED and RESERVED znode lists are
being queries in all iterations on every worker, and also being mutated as
splits are claimed and completed, the workers that never get a split are not
sleeping throughout the input step, but in fact very, VERY slowly iterating
their input split list. In some cases, the step ends before they have finished
one single iteration, even if the input superstep goes on for 30 or more
minutes!
This patch (301-6) dramatically speeds this up by removing the checks for
FINSIHED znodes. The nodes are still created whenever a split is finished by a
worker, so that the master knows when to end the barrier and begin the first
calculation superstep. There is no danger of BSP barriers being tampered with.
Further, every worker must read the whole list of splits at least once from the
top and register every node as RESERVED before it stops trying to read any
additional splits. Therefore, if a worker dies in mid-read, its ephemeral
RESERVED node disappears, and others could possibly claim it, since every node
must still do one full iteration on the list finding all splits RESERVED before
ending its search for good and waiting on the barrier for superstep 0.
This means that the only danger of data loss would be if the very last worker
to iterate fails during a split read. In this case, the next superstep will
never come (as the split is never marked FINISHED) and the job fails anyway. If
a worker dies in Giraph after marking a split FINISHED there is currently no
algorithm in place to restore order to the calculation, even if the worker
could restart and recover, so no harm done on this common failure by the
changes here.
In actual fact, the real story is any worker failing and restarting during the
INPUT_SUPERSTEP currently causes cascading failure to the job. Until we have a
more comprehensive plan for worker failure of this sort, there is no danger
whatsoever to this large optimization in the network load and speed during
input superstep that comes by having the workers evaluate whether to keep
iterating on the input list based on every split being RESERVED rather than
FINISHED. I have added comments to BspServiceWorker#reserveInputSplit() where
the changes are coded to annotate that, should the recovery story for Giraph
change in the future, this algorithm optimization should be revisited.
Again, I have run this to happy completion many times today and can vouch that
it causes no problems for Giraph as-is. If everyone is comfortable with this
change, I think the reduced cost to network (literally cuts ZK reads from all
workers during input phase in half) and the reduced time to finish the
superstep are well worth it.
> InputSplit Reservations are clumping, leaving many workers asleep while other
> process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-301
> URL: https://issues.apache.org/jira/browse/GIRAPH-301
> Project: Giraph
> Issue Type: Improvement
> Components: bsp, graph, zookeeper
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Labels: patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch,
> GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers
> are able to load input splits extremely quickly, and this has altered the
> behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm
> for split reservations. A few workers process multiple splits (often
> overwhelming Netty and getting GC errors as they attempt to offload too much
> data too quick) while many (often most) of the others just sleep through the
> superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its
> reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and
> only wake up if another worker finishes a split, then contend with that
> worker for another split, while the majority of the split list might sit
> idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK
> reads are cheap, only writes are not) this patch is able to get every worker
> involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes
> quickly and painlessly, and without overwhelming Netty by spreading the
> memory load the split readers bear more evenly. If the giraph.splitmb and -w
> options are set correctly, behavior is now exactly as one would expect it to
> be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the
> INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira