[
https://issues.apache.org/jira/browse/HBASE-21775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755347#comment-16755347
]
Tommy Li edited comment on HBASE-21775 at 1/29/19 8:13 PM:
-----------------------------------------------------------
Thanks for the link, [~stack]. I took a look at the report from before my
change went in and indeed TestAsyncProcess [is not listed
there|https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.1/168/artifact/dashboard.html]
.Could this be a build caching issue?
was (Author: tommyzli):
Thanks for the link, [~stack]. I took a look at the report from before my
change went in and indeed TestAsyncProcess [is not listed
there|[https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.1/168/artifact/dashboard.html].]
Could this be a build caching issue?
> The BufferedMutator doesn't ever refresh region location cache
> --------------------------------------------------------------
>
> Key: HBASE-21775
> URL: https://issues.apache.org/jira/browse/HBASE-21775
> Project: HBase
> Issue Type: Bug
> Components: Client
> Reporter: Tommy Li
> Assignee: Tommy Li
> Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.1.3, 2.0.5, 1.3.4
>
> Attachments: HBASE-21775.master.001.patch,
> org.apache.hadoop.hbase.client.TestAsyncProcess-with-HBASE-21775.txt,
> org.apache.hadoop.hbase.client.TestAsyncProcess-without-HBASE-21775.txt
>
>
> {color:#222222}I noticed in some of my writing jobs that the BufferedMutator
> would get stuck retrying writes against a dead server.{color}
> {code:java}
> 19/01/18 15:15:47 INFO [Executor task launch worker for task 0]
> client.AsyncRequestFutureImpl: #2, waiting for 1 actions to finish on table:
> dummy_table
> 19/01/18 15:15:54 WARN [htable-pool3-t56] client.AsyncRequestFutureImpl:
> id=2, table=dummy_table, attempt=15/21, failureCount=1ops, last
> exception=org.apache.hadoop.hbase.DoNotRetryIOException: Operation rpcTimeout
> on <SERVER>,17020,1547848193782, tracking started Fri Jan 18 14:55:37 PST
> 2019; NOT retrying, failed=1 -- final attempt!
> 19/01/18 15:15:54 ERROR [Executor task launch worker for task 0]
> IngestRawData.map(): [B@258bc2c7:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1
> action: Operation rpcTimeout: 1 time, servers with issues:
> <SERVER>,17020,1547848193782
> {code}
>
> After the single remaining action permanently failed, it would resume
> progress only to get stuck again retrying against the same dead server:
> {code:java}
> 19/01/18 15:21:18 INFO [Executor task launch worker for task 0]
> client.AsyncRequestFutureImpl: #2, waiting for 1 actions to finish on table:
> dummy_table
> 19/01/18 15:21:18 INFO [Executor task launch worker for task 0]
> client.AsyncRequestFutureImpl: #2, waiting for 1 actions to finish on table:
> dummy_table
> 19/01/18 15:21:20 INFO [htable-pool3-t55] client.AsyncRequestFutureImpl:
> id=2, table=dummy_table, attempt=6/21, failureCount=1ops, last
> exception=java.net.ConnectException: Call to <SERVER> failed on connection
> exception:
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException:
> connection timed out: <SERVER> on <SERVER>,17020,1547848193782, tracking
> started null, retrying after=20089ms, operationsToReplay=1
> {code}
>
> Only restarting the client process to generate a new BufferedMutator instance
> would fix the issue, at least until the next regionserver crash
> The logs I've pasted show the issue happening with a
> ConnectionTimeoutException, but we've also seen it with
> NotServingRegionException and some others
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)