[ 
https://issues.apache.org/jira/browse/HBASE-27048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566841#comment-17566841
 ] 

Bryan Beaudreault edited comment on HBASE-27048 at 7/14/22 1:27 PM:
--------------------------------------------------------------------

The problem is that these tests make use of ManualEnvironmentEdge, which 
defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection 
properly [passes the EnvironmentEdge's time into the 
ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122],
 but in branch-2.4 it [still passes in 
System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121].
 

In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this 
test) from call.getReceiveTime() (which is current time millis), which ends up 
being a large negative number. We then subtract that value from the timeout, 
which results in a very large positive value timeout. I didn't consider the 
case where now would be much smaller than receive time, since time should 
always be marching forward.

I think there are two potential fixes to make:
 # branch-2.4 should be fixed to pass EnvironmentEdge in 
NettyServerRpcConnection, which will fix this test
 # It might be good to add a guardrail in getRemainingRpcTimeout to ensure that 
we only subtract from the timeout if now > receiveTime. But that also seems 
like an edge case that should really only happen in tests, so not sure if 
that's worth pushing to all 3 branches.

We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just 
the test or add the guardrail too. The latter is just more work submitting to 
all the branches, etc.


was (Author: bbeaudreault):
The problem is that these tests make use of ManualEnvironmentEdge, which 
defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection 
properly [passes the EnvironmentEdge's time into the 
ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122],
 but in branch-2.4 it [still passes in 
System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121].
 

In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this 
test) from call.getReceiveTime() (which is current time millis), which ends up 
being a large negative number. We then subtract that value from the timeout, 
which results in a very large positive value timeout. I didn't consider the 
case where now would be much smaller than receive time, since time should 
always be marching forward.

I think there are two potential fixes to make:
 # branch-2.4 should be fixed to pass EnvironmentEdge in 
NettyServerRpcConnection, which will fix this test
 # It might be good to add a guardrail in getRemainingRpcTimeout to ensure that 
we only subtract from the timeout if now > receiveTime. But that also seems 
like an edge case that should really only happen in tests, so not sure if 
that's worth pushing to all 3 branches.

We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just 
the test or add the guardrail too

> Server side scanner time limit should account for time in queue
> ---------------------------------------------------------------
>
>                 Key: HBASE-27048
>                 URL: https://issues.apache.org/jira/browse/HBASE-27048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> When a scan request comes in with a timeout specified and heartbeats/partials 
> allowed, we calculate a time limit for running the scan to be half of that 
> timeout. The idea is to return before the timeout expires.
> The calculation of that time limit is "now + timeout / 2", where now is the 
> point at which the scan is starting to run. What's missed here is the scan 
> may have spent upwards of a few seconds in the IPC queue before being 
> serviced. In this case, the time limit may extend beyond the timeout of the 
> request and the server will not return in time.
> We should calculate the time limit from ServerCall.getReceiveTime instead to 
> avoid these timeouts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to