[ 
https://issues.apache.org/jira/browse/GIRAPH-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated GIRAPH-37:
------------------------------

    Attachment: GIRAPH-37-wip.patch

Here's a work in progress patch for review and because I have to take next week 
to work on something else, so wanted to get it out before it went stale.  It 
uses Finagle with Thrift.  This experience was at first challenging due to 
Finagle ramp-up costs, then nice, now a challenging again due to stability 
issues.  95% of the size of the patch is generated thrift code; I'm not usually 
a fan on including generated code, but as explained below, this is a reasonable 
approach for Finagle.

The good:
* With this patch I can scale up to about 1k workers, although not reliably 
(see bad points)
* This approach moves us away from Hadoop RPC, which is good for the upcoming 
Yarn work and because Hadoop RPC itself is not ideal.
* Looking at what Hyunsik was having to go through when he was looking at 
Netty+PB, Finagle definitely saves quite a lot of work.
* This exercise has identified several improvements to the overall that need to 
be done.  I've opened GIRAPH-57, GIRAPH-55 and GIRAPH-54 for these.

The bad:
* The Thrift-Finagle combination uses a forked version of the thrift compiler 
to generate the interface Finagle expects.  Once up and running this is fine, 
but it means that we'd be dependent on this oddity.  Also, we'd need to include 
the generated code since it's too much to ask regular developers (not 
interested in the rpc) to download a new thrift compiler from github, compile 
it, keep it around, etc.
* There are quite a lot of knobs necessary to get a reliable run with a large 
number of mappers.  This is partially a fact of life of a distributed rpc and 
we can probably determine some of them programmatically, but at the moment, I 
can only get successful runs about 2/3 of the time.  The rest I get very 
difficult to decipher stack traces such as:
{noformat}
WARNING: An exception was thrown by a user handler while handling an exception 
event ([id: 0x4b7f1841, /172.18.67.79:46082 :> 
esv4-hcl227.corp.linkedin.com/172.18.66.182:30047] EXCEPTION: 
com.twitter.util.Promise$ImmutableResult: Result set multiple times: 
Throw(java.lang.RuntimeException: Hit exception in proxied call))
java.lang.RuntimeException: Hit exception in proxied call
        at 
org.apache.giraph.comm.finaglerpc.ThriftRPCProxyClient$CDLListener.onFailure(ThriftRPCProxyClient.java:91)
        at 
com.twitter.util.Future$$anonfun$addEventListener$1.apply(Future.scala:277)
        at 
com.twitter.util.Future$$anonfun$addEventListener$1.apply(Future.scala:276)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
        at com.twitter.concurrent.IVar.set(IVar.scala:50)
        at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
        at com.twitter.util.Promise.update(Future.scala:450)
        at com.twitter.util.Promise$$anon$2$$anonfun$8.apply(Future.scala:506)
        at com.twitter.util.Promise$$anon$2$$anonfun$8.apply(Future.scala:497)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
        at com.twitter.concurrent.IVar.set(IVar.scala:50)
        at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
        at com.twitter.util.Promise.update(Future.scala:450)
        at 
com.twitter.finagle.service.RetryingFilter$$anonfun$1.apply(RetryingFilter.scala:73)
        at 
com.twitter.finagle.service.RetryingFilter$$anonfun$1.apply(RetryingFilter.scala:56)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
        at com.twitter.concurrent.IVar.set(IVar.scala:50)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
        at com.twitter.util.Promise.update(Future.scala:450)
        at 
com.twitter.util.Promise$$anon$2$$anonfun$8$$anonfun$apply$7.apply(Future.scala:502)
        at 
com.twitter.util.Promise$$anon$2$$anonfun$8$$anonfun$apply$7.apply(Future.scala:502)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
        at com.twitter.concurrent.IVar.set(IVar.scala:50)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
        at com.twitter.util.Promise.update(Future.scala:450)
        at com.twitter.util.Promise$$anon$1$$anonfun$7.apply(Future.scala:491)
        at com.twitter.util.Promise$$anon$1$$anonfun$7.apply(Future.scala:490)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
        at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
        at com.twitter.concurrent.IVar.set(IVar.scala:50)
        at com.twitter.concurrent.IVar.set(IVar.scala:55)
        at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
        at com.twitter.util.Promise.update(Future.scala:450)
        at 
com.twitter.finagle.channel.ChannelService.com$twitter$finagle$channel$ChannelService$$reply(ChannelService.scala:51)
        at 
com.twitter.finagle.channel.ChannelService$$anon$1.exceptionCaught(ChannelService.scala:74)
        at 
org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:66)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:238)
        at 
com.twitter.finagle.thrift.ThriftFrameCodec.handleUpstream(ThriftFrameCodec.scala:11)
        at 
org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:432)
        at 
org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:52)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at 
org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:76)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:216)
        at 
com.twitter.finagle.thrift.ThriftFrameCodec.handleUpstream(ThriftFrameCodec.scala:11)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
        at 
org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
{noformat}
another one that happens quite a lot is {{Caused by: 
com.twitter.finagle.UnknownChannelException: 
com.twitter.util.Promise$ImmutableResult: Result set multiple times: 
Throw(java.lang.RuntimeException: Hit exception in proxied call)}}.  I think I 
need some aid from someone more experienced with Finagle, but I'm a bit nervous 
about the underlying framework being difficult to debug and configure.

Currently the patch passes all unit tests (and needs more for the finagle 
section itself).  Overall, I think the patch is worth pursuing and could be 
committed with the Hadoop RPC as the default RPC and the config/stability 
issues resolved in follow-up patches.  Perhaps it's just an issue of lousy 
configuration on my part.  Another option would be to look in a different 
direction, such as MessagePack.

Thoughts?
                
> Implement Netty-backed rpc solution
> -----------------------------------
>
>                 Key: GIRAPH-37
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-37
>             Project: Giraph
>          Issue Type: New Feature
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: GIRAPH-37-wip.patch
>
>
> GIRAPH-12 considered replacing the current Hadoop based rpc method with 
> Netty, but didn't went in another direction. I think there is still value in 
> this approach, and will also look at Finagle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to