[GitHub] giraph issue #96: GIRAPH-1213: Fix issues with network requests retries and ...
Github user majakabiljo commented on the issue: https://github.com/apache/giraph/pull/96 I used a pipeline which runs 100 jobs and was always getting at least a few jobs stuck with open network requests. Running it with more logging helped identify these two issues, and after the change it was 100% successful. ---
[GitHub] giraph pull request #96: GIRAPH-1213: Fix issues with network requests retri...
Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/96#discussion_r240745235 --- Diff: giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java --- @@ -1147,8 +1158,11 @@ private void checkRequestsAfterChannelFailure(final Channel channel) { resendRequestsWhenNeeded(new Predicate() { @Override public boolean apply(RequestInfo requestInfo) { -return requestInfo.getDestinationAddress().equals( -channel.remoteAddress()); +if (requestInfo.getWriteFuture() == null || --- End diff -- It can happen if the request wasn't sent out yet, not sure if there is some other scenario. ---
[GitHub] giraph pull request #97: GIRAPH-1215: Make FixedCapacityHeaps work with 0 ca...
GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/97 GIRAPH-1215: Make FixedCapacityHeaps work with 0 capacity Currently FixedCapacityHeaps throw an exception when they are used with capacity 0, this fixes it. Most of the diff is autogenerated, only change in FixedCapacityType2TypeMinHeap.java You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph giraph-1215 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/97.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #97 commit 19b7b035dab9a04c79ebb46149ab07366751895a Author: Maja Kabiljo Date: 2018-12-05T17:54:55Z GIRAPH-1215: Make FixedCapacityHeaps work with 0 capacity Currently FixedCapacityHeaps throw an exception when they are used with capacity 0, this fixes it. ---
[GitHub] giraph pull request #96: GIRAPH-1213: Fix issues with network requests retri...
GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/96 GIRAPH-1213: Fix issues with network requests retries and add more logging Fixing two bugs: - When channel fails, we are currently retrying all requests towards the destination machine from the channel, instead of just ones which are happening on the concrete channel. - In practice, we've noticed BlockingOperationException can get thrown when we wait to connect on channel in which case we silently don't send the request we are trying to send, so catching this exception and retrying instead. Also added logging of channel ids to be able to debug issues related to network requests not delivering easier. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph giraph-1213 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/96.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #96 commit 581dd9bbf47d02ceddf0aba2e8c97e80d7d6f44c Author: Maja Kabiljo Date: 2018-11-29T19:35:53Z GIRAPH-1213: Fix issues with network requests retries and add more logging Fixing two bugs: - When channel fails, we are currently retrying all requests towards the destination machine from the channel, instead of just ones which are happening on the concrete channel. - In practice, we've noticed BlockingOperationException can get thrown when we wait to connect on channel in which case we silently don't send the request we are trying to send, so catching this exception and retrying instead. Also added logging of channel ids to be able to debug issues related to network requests not delivering easier. ---
[GitHub] giraph pull request #94: Fix DefaultJobProgressTracker when splitMasterWorke...
GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/94 Fix DefaultJobProgressTracker when splitMasterWorker=false Summary: DefaultJobProgressTracker assumes we are using numWorkers+1 mappers, fix that Test Plan: Ran a job with splitMasterWorker=false, verified job progress gets printed correctly You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph giraph-1212 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/94.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #94 commit 0a959f4c179dd385f28d2a3a28ae0287caf5b35a Author: Maja Kabiljo Date: 2018-11-26T18:15:15Z Fix DefaultJobProgressTracker when splitMasterWorker=false Summary: DefaultJobProgressTracker assumes we are using numWorkers+1 mappers, fix that Test Plan: Ran a job with splitMasterWorker=false, verified job progress gets printed correctly ---
[GitHub] giraph pull request #93: GIRAPH-1211: Make retrying to send network requests...
GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/93 GIRAPH-1211: Make retrying to send network requests after timeout optional Summary: Using counters added in GIRAPH-1205 we were able to confirm that resending network requests after timeout almost never succeeds, so add an option to fail early instead of keep trying to resend these network requests indefinitely. Test Plan: Made response not being sent for a specific request id, verified that with keeping default value for this new option we kept trying to resend the request, and when making it false the job failed when request timeout was noticed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph giraph-1211 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 1ce1a8e50565cf7adb4d5ebce872620949f3f906 Author: Maja Kabiljo Date: 2018-11-07T04:50:28Z GIRAPH-1211: Make retrying to send network requests after timeout optional Summary: Using counters added in GIRAPH-1205 we were able to confirm that resending network requests after timeout almost never succeeds, so add an option to fail early instead of keep trying to resend these network requests indefinitely. Test Plan: Made response not being sent for a specific request id, verified that with keeping default value for this new option we kept trying to resend the request, and when making it false the job failed when request timeout was noticed. ---
[GitHub] giraph pull request #88: GIRAPH-1205: Separate Giraph counters for different...
Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/88#discussion_r226012078 --- Diff: giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java --- @@ -141,6 +141,9 @@ /** How many network requests were resent because channel failed */ public static final String NETWORK_REQUESTS_RESENT_FOR_CHANNEL_FAILURE_NAME = "Network requests resent for channel failure"; + /** How many network requests were resent because connection failed */ + public static final String NETWORK_REQUESTS_RESENT_FOR_CONNECTION_FAILURE = --- End diff -- Nit: NETWORK_REQUESTS_RESENT_FOR_CONNECTION_FAILURE_NAME ---
[GitHub] giraph pull request #86: Fix tests
GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/86 Fix tests mvn verify passes You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph fix-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/86.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #86 commit 9a48f3ef93c0cd320063332b7f07814bb2e0622d Author: Maja Kabiljo Date: 2018-09-21T20:23:08Z Fix tests ---
[GitHub] giraph pull request #84: Fix BspServiceMaster bug
Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/84#discussion_r218551052 --- Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java --- @@ -1379,9 +1379,15 @@ private boolean barrierOnWorkerList(String finishedWorkerPath, // Wait for a signal or timeout boolean eventTriggered = event.waitMsecs(eventLoopTimeout); + + // If the event was triggered, we reset it. In the next loop run, we will + // read ZK to get the new hosts. + if (eventTriggered) { +event.reset(); + } + long elapsedTimeSinceRegularRunMsec = System.currentTimeMillis() - lastRegularRunTimeMsec; - event.reset(); --- End diff -- Making sure I understand what's happening, does reodering of commands here makes no difference and the fix is about only clearing event if it triggered? ---