[
https://issues.apache.org/jira/browse/ACCUMULO-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882847#comment-13882847
]
Eric Newton commented on ACCUMULO-2261:
---------------------------------------
In this case, a tablet was assigned to server A, and that server was
considered down before the tablet was loaded. It was then assigned to server
B. But both server A and B updated the location information. And both servers
are running on the same computer (since they have the same IP).
There are several mechanisms used to avoid this problem:
# Port number: two processes should not be able to open the same address on a
unix-like system.
# zookeeper lock. Server B should not have been able to get a lock if Server A
had the lock
# updates to the !METADATA table (which hold the assigned and current location)
are protected by a constraint that verifies that the server still holds its lock
# the tablet server checks that the tablet is assigned to it before it loads
the tablet
Decoding the write-ahead logs of the !METADATA tablet would give us a clearer
idea of what order things happened. If this happens again, please copy the
!METADATA walog files for post-analysis. Look for these entries:
{noformat}
shell> scan -b !0; -e !0< -c log
{noformat}
For this to have happened, the master needed to see the zookeeper lock of A
expire, read the old status of "assigned," the server B had to start and be
noticed by the master, the old server had to write a last-gasp update the the
!METADATA table and server serving the !METADATA table had to read a cached,
old lock from zookeeper.
Are you automatically restarting your tservers, by any chance?
> duplicate locations
> -------------------
>
> Key: ACCUMULO-2261
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2261
> Project: Accumulo
> Issue Type: Bug
> Components: master, tserver
> Affects Versions: 1.5.0
> Environment: hadoop 2.2.0 and zookeeper 3.4.5
> Reporter: Eric Newton
> Assignee: Eric Newton
> Priority: Blocker
> Fix For: 1.5.1
>
>
> Anthony F reports the following:
> bq. I have observed a loss of data when tservers fail during bulk ingest.
> The keys that are missing are right around the table's splits indicating that
> data was lost when a tserver died during a split. I am using Accumulo 1.5.0.
> At around the same time, I observe the master logging a message about "Found
> two locations for the same extent".
> And:
> bq. I'm currently digging through the logs and will report back. Keep in
> mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack. To determine data
> loss, I have a 'ConsistencyCheckingIterator' that verifies each row has the
> expected data (it takes a long time to scan the whole table). Below is a
> quick summary of what happened. The tablet in question is "d;72~gcm~201304".
> Notice that it is assigned to 192.168.2.233:9997[343bc1fa155242c] at
> 2014-01-25 09:49:36,233. At 2014-01-25 09:49:54,141, the tserver goes away.
> Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432] and
> shortly after that, I see the BadLocationStateException. The master never
> recovers from the BLSE - I have to manually delete one of the offending
> locations.
> {noformat}
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
> tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
> tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers
> [192.168.2.233:9997[343bc1fa155242c]]
> 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers:
> [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
>
> d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null),
> d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;38;37@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;51;50@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;60;59@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;92;91@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;04;03@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;50;49@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;63;62@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;74;73@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;97;96@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;08~thm~20121;08@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;09~thm~20121;09@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;10;09~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;18~thm~2012101;18@(192.168.2.233:9997[343bc1fa155242c],null,null),
> p;21;20~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;22~thm~2012091;22@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;23;22~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;41~thm~2012111;41@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;42;41~thm~2012111@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;58~thm~201208;58@(null,192.168.2.233:9997[343bc1fa155242c],null)]...
> 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning
> tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
> 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet
> d;72~gcm~201304;72 was loaded on 192.168.2.223:9997
> 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR:
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent d;72~gcm~201304:
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent d;72~gcm~201304:
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)