[jira] [Commented] (TEPHRA-257) If start() encounters an RPC timeout, an invalid transaction is left behind

Andreas Neumann (JIRA) Thu, 12 Oct 2017 13:12:28 -0700

    [ 
https://issues.apache.org/jira/browse/TEPHRA-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202550#comment-16202550
 ]


Andreas Neumann commented on TEPHRA-257:
----------------------------------------

It turns out that this cannot be fixed as long as Tephra uses Thrift. Even 
though we could, in theory, attempt to modify Thrift's ProcessFunction class: 
{code}
    if(!isOneway()) {
      oprot.writeMessageBegin(new TMessage(getMethodName(), TMessageType.REPLY, 
seqid));
      result.write(oprot);
      oprot.writeMessageEnd();
      oprot.getTransport().flush();
    }
{code}
by wrapping this into a try block and catching any socket exceptions. But it 
turns out that the flush() does not flush to the socket: due to Thrift's async 
nature, it flushes to a write request queue, and the worker thread that 
performs the write will experience the socket exception. At that time, we have 
lost the context and can't have a callback to abort the transaction. 

Thus marking this as won't fix. 

> If start() encounters an RPC timeout, an invalid transaction is left behind
> ---------------------------------------------------------------------------
>
>                 Key: TEPHRA-257
>                 URL: https://issues.apache.org/jira/browse/TEPHRA-257
>             Project: Tephra
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.13.0-incubating
>            Reporter: Andreas Neumann
>            Assignee: Poorna Chandra
>
> Suppose the following scenario: 
> - a thrift client starts a transaction
> - the server responds, but for whatever reason it is slow 
> - by the time the response is sent, the client has timed out the connection
> - now the server has started a transaction, but the client has no knowledge 
> of it
> - that transaction will never be committed or aborted and eventually times out
> - it becomes an invalid transaction
> This is a common scenario when HDFS is slow and the write load is high. This 
> means, a lot of change ids have to be written to a slow transaction log. Now 
> we will generate invalid transactions systematically, which eventually 
> degrades the performance of the entire system.
> It would be good if the server could detect this situation and abort the 
> transaction immediately. This is safe to do whenever sending of the response 
> fails, because we know that the client did not receive it, and hence it will 
> not generate data with that transaction id. 
> This is a tricky change, though: Thrift does not give us a way to intercept 
> exceptions from socket failures. We would have to copy a Thrift class 
> (ProcessFunction) and change it to handle exceptions that occur during the 
> write of the response. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TEPHRA-257) If start() encounters an RPC timeout, an invalid transaction is left behind

Reply via email to