[ 
https://issues.apache.org/jira/browse/ACCUMULO-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884354#comment-13884354
 ] 

Keith Turner commented on ACCUMULO-2261:
----------------------------------------

[~elserj]

bq. Right, I was just making the distinction between making the check in the 
AssignmentHandler and in the MetadataConstraints.

Doing the check in the metadata constraint means it will be done later.   If 
the assignment handler does the check and then issues a metadata table update, 
the actual update could take place much later (even after the requesting 
tserver has died).  Ifs also possible that the lock could be lost when doing 
the check in the constraint.   The check in the constraint is sanity check, but 
it does not prevent all problems.   Doing the sanity check as a late as 
possible is better.  Note that even if zoocache is not used, there is still a 
race condition.  If the lock is lost after zookeeper is checked directly, then 
its a problem.   Using conditional mutations (and CAS in zookeeper) is a better 
solution.  It allows us to only make the mutation is future and current 
locations are in the expected state.

[~ecn] I don't think we should make 1.7 use conditional mutations for this 
case.  I thnik 1.7 should have  a ticket to make one big change to make all 
metadata operations use CAS w/ zookeeper and metadata table in places where it 
makes sense.  Seems like it would be nice to do this in a branch and work out 
the kinks and then merge it in.   I will open a ticket.  This is something I 
was thinking of donig for 1.6, but never got around to it.



> duplicate locations
> -------------------
>
>                 Key: ACCUMULO-2261
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2261
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.5.0
>         Environment: hadoop 2.2.0 and zookeeper 3.4.5
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.5.1
>
>
> Anthony F reports the following:
> bq. I have observed a loss of data when tservers fail during bulk ingest.  
> The keys that are missing are right around the table's splits indicating that 
> data was lost when a tserver died during a split.  I am using Accumulo 1.5.0. 
>  At around the same time, I observe the master logging a message about "Found 
> two locations for the same extent". 
> And:
> bq.  I'm currently digging through the logs and will report back.  Keep in 
> mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack.  To determine data 
> loss, I have a 'ConsistencyCheckingIterator' that verifies each row has the 
> expected data (it takes a long time to scan the whole table).  Below is a 
> quick summary of what happened.  The tablet in question is "d;72~gcm~201304". 
>  Notice that it is assigned to 192.168.2.233:9997[343bc1fa155242c] at 
> 2014-01-25 09:49:36,233.  At 2014-01-25 09:49:54,141, the tserver goes away.  
> Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432] and 
> shortly after that, I see the BadLocationStateException.  The master never 
> recovers from the BLSE - I have to manually delete one of the offending 
> locations.
> {noformat}
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning 
> tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning 
> tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers 
> [192.168.2.233:9997[343bc1fa155242c]]
> 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers: 
> [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  
> d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null), 
> d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
>  m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> m;38;37@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> m;51;50@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> m;60;59@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> m;92;91@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;04;03@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;50;49@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;63;62@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;74;73@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> o;97;96@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;08~thm~20121;08@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;09~thm~20121;09@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;10;09~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;18~thm~2012101;18@(192.168.2.233:9997[343bc1fa155242c],null,null), 
> p;21;20~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;22~thm~2012091;22@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;23;22~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;41~thm~2012111;41@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;42;41~thm~2012111@(null,192.168.2.233:9997[343bc1fa155242c],null), 
> p;58~thm~201208;58@(null,192.168.2.233:9997[343bc1fa155242c],null)]...
> 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning 
> tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
> 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet 
> d;72~gcm~201304;72 was loaded on 192.168.2.223:9997
> 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR: 
> java.lang.RuntimeException: 
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
>  found two locations for the same extent d;72~gcm~201304: 
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> java.lang.RuntimeException: 
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
>  found two locations for the same extent d;72~gcm~201304: 
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to