[GitHub] giraph issue #96: GIRAPH-1213: Fix issues with network requests retries and ...

2018-12-11 Thread majakabiljo
Github user majakabiljo commented on the issue:

https://github.com/apache/giraph/pull/96
  
I used a pipeline which runs 100 jobs and was always getting at least a few 
jobs stuck with open network requests. Running it with more logging helped 
identify these two issues, and after the change it was 100% successful.


---


[GitHub] giraph pull request #96: GIRAPH-1213: Fix issues with network requests retri...

2018-12-11 Thread majakabiljo
Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/96#discussion_r240745235
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java ---
@@ -1147,8 +1158,11 @@ private void checkRequestsAfterChannelFailure(final 
Channel channel) {
 resendRequestsWhenNeeded(new Predicate() {
   @Override
   public boolean apply(RequestInfo requestInfo) {
-return requestInfo.getDestinationAddress().equals(
-channel.remoteAddress());
+if (requestInfo.getWriteFuture() == null ||
--- End diff --

It can happen if the request wasn't sent out yet, not sure if there is some 
other scenario.


---


[GitHub] giraph pull request #97: GIRAPH-1215: Make FixedCapacityHeaps work with 0 ca...

2018-12-05 Thread majakabiljo
GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/97

GIRAPH-1215: Make FixedCapacityHeaps work with 0 capacity

Currently FixedCapacityHeaps throw an exception when they are used with 
capacity 0, this fixes it.

Most of the diff is autogenerated, only change in 
FixedCapacityType2TypeMinHeap.java

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph giraph-1215

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #97


commit 19b7b035dab9a04c79ebb46149ab07366751895a
Author: Maja Kabiljo 
Date:   2018-12-05T17:54:55Z

GIRAPH-1215: Make FixedCapacityHeaps work with 0 capacity

Currently FixedCapacityHeaps throw an exception when they are used with 
capacity 0, this fixes it.




---


[GitHub] giraph pull request #96: GIRAPH-1213: Fix issues with network requests retri...

2018-11-29 Thread majakabiljo
GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/96

GIRAPH-1213: Fix issues with network requests retries and add more logging

Fixing two bugs:
- When channel fails, we are currently retrying all requests towards the 
destination machine from the channel, instead of just ones which are happening 
on the concrete channel.
- In practice, we've noticed BlockingOperationException can get thrown when 
we wait to connect on channel in which case we silently don't send the request 
we are trying to send, so catching this exception and retrying instead.
Also added logging of channel ids to be able to debug issues related to 
network requests not delivering easier.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph giraph-1213

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/96.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #96


commit 581dd9bbf47d02ceddf0aba2e8c97e80d7d6f44c
Author: Maja Kabiljo 
Date:   2018-11-29T19:35:53Z

GIRAPH-1213: Fix issues with network requests retries and add more logging

Fixing two bugs:
- When channel fails, we are currently retrying all requests towards the 
destination machine from the channel, instead of just ones which are happening 
on the concrete channel.
- In practice, we've noticed BlockingOperationException can get thrown when 
we wait to connect on channel in which case we silently don't send the request 
we are trying to send, so catching this exception and retrying instead.
Also added logging of channel ids to be able to debug issues related to 
network requests not delivering easier.




---


[GitHub] giraph pull request #94: Fix DefaultJobProgressTracker when splitMasterWorke...

2018-11-26 Thread majakabiljo
GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/94

Fix DefaultJobProgressTracker when splitMasterWorker=false

Summary: DefaultJobProgressTracker assumes we are using numWorkers+1 
mappers, fix that

Test Plan: Ran a job with splitMasterWorker=false, verified job progress 
gets printed correctly

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph giraph-1212

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/94.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #94


commit 0a959f4c179dd385f28d2a3a28ae0287caf5b35a
Author: Maja Kabiljo 
Date:   2018-11-26T18:15:15Z

Fix DefaultJobProgressTracker when splitMasterWorker=false

Summary: DefaultJobProgressTracker assumes we are using numWorkers+1 
mappers, fix that

Test Plan: Ran a job with splitMasterWorker=false, verified job progress 
gets printed correctly




---


[GitHub] giraph pull request #93: GIRAPH-1211: Make retrying to send network requests...

2018-11-06 Thread majakabiljo
GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/93

GIRAPH-1211: Make retrying to send network requests after timeout optional

Summary: Using counters added in GIRAPH-1205 we were able to confirm that 
resending network requests after timeout almost never succeeds, so add an 
option to fail early instead of keep trying to resend these network requests 
indefinitely.

Test Plan: Made response not being sent for a specific request id, verified 
that with keeping default value for this new option we kept trying to resend 
the request, and when making it false the job failed when request timeout was 
noticed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph giraph-1211

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 1ce1a8e50565cf7adb4d5ebce872620949f3f906
Author: Maja Kabiljo 
Date:   2018-11-07T04:50:28Z

GIRAPH-1211: Make retrying to send network requests after timeout optional

Summary: Using counters added in GIRAPH-1205 we were able to confirm that 
resending network requests after timeout almost never succeeds, so add an 
option to fail early instead of keep trying to resend these network requests 
indefinitely.

Test Plan: Made response not being sent for a specific request id, verified 
that with keeping default value for this new option we kept trying to resend 
the request, and when making it false the job failed when request timeout was 
noticed.




---


[GitHub] giraph pull request #88: GIRAPH-1205: Separate Giraph counters for different...

2018-10-17 Thread majakabiljo
Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/88#discussion_r226012078
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java ---
@@ -141,6 +141,9 @@
   /** How many network requests were resent because channel failed */
   public static final String 
NETWORK_REQUESTS_RESENT_FOR_CHANNEL_FAILURE_NAME =
   "Network requests resent for channel failure";
+  /** How many network requests were resent because connection failed */
+  public static final String 
NETWORK_REQUESTS_RESENT_FOR_CONNECTION_FAILURE =
--- End diff --

Nit: NETWORK_REQUESTS_RESENT_FOR_CONNECTION_FAILURE_NAME


---


[GitHub] giraph pull request #86: Fix tests

2018-09-26 Thread majakabiljo
GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/86

Fix tests

mvn verify passes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph fix-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/86.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #86


commit 9a48f3ef93c0cd320063332b7f07814bb2e0622d
Author: Maja Kabiljo 
Date:   2018-09-21T20:23:08Z

Fix tests




---


[GitHub] giraph pull request #84: Fix BspServiceMaster bug

2018-09-26 Thread majakabiljo
Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/84#discussion_r218551052
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
@@ -1379,9 +1379,15 @@ private boolean barrierOnWorkerList(String 
finishedWorkerPath,
 
   // Wait for a signal or timeout
   boolean eventTriggered = event.waitMsecs(eventLoopTimeout);
+
+  // If the event was triggered, we reset it. In the next loop run, we 
will
+  // read ZK to get the new hosts.
+  if (eventTriggered) {
+event.reset();
+  }
+
   long elapsedTimeSinceRegularRunMsec = System.currentTimeMillis() -
   lastRegularRunTimeMsec;
-  event.reset();
--- End diff --

Making sure I understand what's happening, does reodering of commands here 
makes no difference and the fix is about only clearing event if it triggered?


---