[
https://issues.apache.org/jira/browse/HBASE-27048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566841#comment-17566841
]
Bryan Beaudreault edited comment on HBASE-27048 at 7/14/22 1:27 PM:
--------------------------------------------------------------------
The problem is that these tests make use of ManualEnvironmentEdge, which
defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection
properly [passes the EnvironmentEdge's time into the
ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122],
but in branch-2.4 it [still passes in
System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121].
In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this
test) from call.getReceiveTime() (which is current time millis), which ends up
being a large negative number. We then subtract that value from the timeout,
which results in a very large positive value timeout. I didn't consider the
case where now would be much smaller than receive time, since time should
always be marching forward.
I think there are two potential fixes to make:
# branch-2.4 should be fixed to pass EnvironmentEdge in
NettyServerRpcConnection, which will fix this test
# It might be good to add a guardrail in getRemainingRpcTimeout to ensure that
we only subtract from the timeout if now > receiveTime. But that also seems
like an edge case that should really only happen in tests, so not sure if
that's worth pushing to all 3 branches.
We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just
the test or add the guardrail too. The latter is just more work submitting to
all the branches, etc.
was (Author: bbeaudreault):
The problem is that these tests make use of ManualEnvironmentEdge, which
defaults to a currentTime of 1. In branch-2 and 2.5, NettyServerRpcConnection
properly [passes the EnvironmentEdge's time into the
ServerCall|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L122],
but in branch-2.4 it [still passes in
System.currentTimeMillis()|https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyServerRpcConnection.java#L121].
In RSRpcServices getRemainingRpcTimeout, we subtract now (which is 1 in this
test) from call.getReceiveTime() (which is current time millis), which ends up
being a large negative number. We then subtract that value from the timeout,
which results in a very large positive value timeout. I didn't consider the
case where now would be much smaller than receive time, since time should
always be marching forward.
I think there are two potential fixes to make:
# branch-2.4 should be fixed to pass EnvironmentEdge in
NettyServerRpcConnection, which will fix this test
# It might be good to add a guardrail in getRemainingRpcTimeout to ensure that
we only subtract from the timeout if now > receiveTime. But that also seems
like an edge case that should really only happen in tests, so not sure if
that's worth pushing to all 3 branches.
We definitely need to do 1, but thoughts on 2? I can submit a Jira to fix just
the test or add the guardrail too
> Server side scanner time limit should account for time in queue
> ---------------------------------------------------------------
>
> Key: HBASE-27048
> URL: https://issues.apache.org/jira/browse/HBASE-27048
> Project: HBase
> Issue Type: Improvement
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> When a scan request comes in with a timeout specified and heartbeats/partials
> allowed, we calculate a time limit for running the scan to be half of that
> timeout. The idea is to return before the timeout expires.
> The calculation of that time limit is "now + timeout / 2", where now is the
> point at which the scan is starting to run. What's missed here is the scan
> may have spent upwards of a few seconds in the IPC queue before being
> serviced. In this case, the time limit may extend beyond the timeout of the
> request and the server will not return in time.
> We should calculate the time limit from ServerCall.getReceiveTime instead to
> avoid these timeouts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)