[ 
https://issues.apache.org/jira/browse/HBASE-13011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315774#comment-14315774
 ] 

zhangduo commented on HBASE-13011:
----------------------------------

Almost there. I think I found a data race.

In AsyncRpcChannel.startHBaseConnection, we will call writeRequest when connect 
operation complete to write out all pendingCalls.
But at the same time, AsyncRpcChannel.callMethod will put call to pendingCalls, 
and then call writeRequest.

So there could a situation, one call is written out twice.

Any suggestion on how to fix it? [~stack] , [~jurmous]

IMHO, the locking schema is not clear in AsyncRpcClient and AsyncRpcChannel. 
Maybe we need to revisit it and use some standard locking method? Thanks~

> TestLoadIncrementalHFiles is flakey when using AsyncRpcClient as client 
> implementation
> --------------------------------------------------------------------------------------
>
>                 Key: HBASE-13011
>                 URL: https://issues.apache.org/jira/browse/HBASE-13011
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: zhangduo
>
> The test sometimes failed because of timeout.
> https://builds.apache.org/job/PreCommit-HBASE-Build/12769/testReport/junit/org.apache.hadoop.hbase.mapreduce/TestLoadIncrementalHFiles/testSimpleLoad/
> Dig into it, I found this
> {noformat}
> 2015-02-11 02:01:47,304 INFO  [LoadIncrementalHFiles-1] 
> mapreduce.LoadIncrementalHFiles(563): Trying to load 
> hfile=hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_1
>  first=ddd last=ooo
> 2015-02-11 02:01:47,308 INFO  [LoadIncrementalHFiles-0] 
> mapreduce.LoadIncrementalHFiles(563): Trying to load 
> hfile=hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_0
>  first=aaaa last=cccc
> 2015-02-11 02:01:47,317 DEBUG [LoadIncrementalHFiles-2] 
> mapreduce.LoadIncrementalHFiles$3(664): Going to connect to server 
> region=bulkNS:mytable_testSimpleLoad,,1423620104753.fdcbd21e43683c753bae40f1d890daa6.,
>  hostname=asf910.gq1.ygridcore.net,41003,1423620099272, seqNum=2 for row  
> with hfile group 
> [{[B@7173d25a,hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_0}]
> 2015-02-11 02:01:47,320 DEBUG [LoadIncrementalHFiles-3] 
> mapreduce.LoadIncrementalHFiles$3(664): Going to connect to server 
> region=bulkNS:mytable_testSimpleLoad,ddd,1423620104753.ec757ff718ce8ab99f4f6bcca389d67f.,
>  hostname=asf910.gq1.ygridcore.net,41003,1423620099272, seqNum=2 for row ddd 
> with hfile group 
> [{[B@7173d25a,hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_1}]
> {noformat}
> There are two files to commit, but after this
> {noformat}
> 2015-02-11 02:01:47,327 INFO  
> [B.defaultRpcServer.handler=3,queue=0,port=41003] regionserver.HStore(690): 
> Validating hfile at 
> hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_0
>  for inclusion in store myfam region 
> bulkNS:mytable_testSimpleLoad,,1423620104753.fdcbd21e43683c753bae40f1d890daa6.
> 2015-02-11 02:01:47,330 INFO  
> [B.defaultRpcServer.handler=1,queue=0,port=41003] regionserver.HStore(690): 
> Validating hfile at 
> hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_1
>  for inclusion in store myfam region 
> bulkNS:mytable_testSimpleLoad,ddd,1423620104753.ec757ff718ce8ab99f4f6bcca389d67f.
> 2015-02-11 02:01:47,330 INFO  
> [B.defaultRpcServer.handler=4,queue=0,port=41003] regionserver.HStore(690): 
> Validating hfile at 
> hdfs://localhost:59736/user/jenkins/test-data/d964a632-8db5-4f3a-966f-89746947294b/testSimpleLoad/myfam/hfile_1
>  for inclusion in store myfam region 
> bulkNS:mytable_testSimpleLoad,ddd,1423620104753.ec757ff718ce8ab99f4f6bcca389d67f.
> {noformat}
> We can see that hfile_1 have been committed twice and the second call will 
> fail and cause the test timeout.
> I'm not sure if it is a issue of AsyncRpcClient. But if I use RpcClientImpl, 
> the test always passes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to