[
https://issues.apache.org/jira/browse/ACCUMULO-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882936#comment-13882936
]
Josh Elser edited comment on ACCUMULO-2261 at 1/27/14 4:39 PM:
---------------------------------------------------------------
We have ntp setup to synchronize system clocks. Here's the log from
192.168.2.233.
_applied jira formatting_
{noformat}
2014-01-25 09:49:11,708 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=483.68814MB (507183784), Free=1564.3119MB (1640299864), Max=2048.0MB
(2147483648), Counts: Blocks=835, Access=30151167, Hit=30150280,
Miss=887, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.9970555305481%, Miss
Ratio=0.0029418428312055767%, Evicted/Run=NaN, Duplicate Reads=18
2014-01-25 09:49:11,711 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=6742.9854MB (7070532384), Free=1449.0149MB (1519402208), Max=8192.0MB
(8589934592), Counts: Blocks=68811, Access=43909183, Hit=4336353
1, Miss=545652, Evictions=57, Evicted=476838, Ratios: Hit
Ratio=98.75731468200684%, Miss Ratio=1.242683082818985%,
Evicted/Run=8365.5791015625, Duplicate Reads=0
2014-01-25 09:49:11,712 [tabletserver.TabletServer] DEBUG: gc
ParNew=897.13(+17.22) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=4,064,834,072(+1,745,404,552) totalmem=12,777,553,920
2014-01-25 09:49:17,455 [tabletserver.TabletServer] DEBUG: gc
ParNew=901.79(+4.67) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,913,871,704(+667,572,648) totalmem=12,777,553,920
2014-01-25 09:49:17,871 [cache.LruBlockCache] DEBUG: Block cache LRU eviction
started. Attempting to free 859003704 bytes
2014-01-25 09:49:17,942 [cache.LruBlockCache] DEBUG: Block cache LRU eviction
completed. Freed 859019280 bytes. Priority Sizes: Single=2512.0676MB
(2634093856), Multi=4456.495MB (4672973712),Memory=0.0MB (0)
2014-01-25 09:49:19,569 [tabletserver.TabletServer] DEBUG: gc
ParNew=903.09(+1.30) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,733,299,800(-180,571,904) totalmem=12,777,553,920
2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc
ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920
2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet
d;72~gcm~201304;72
2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO :
dcloud18/192.168.2.233:9997: got assignment from master: d;72~gcm~201304;72
2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Loading extent:
d;72~gcm~201304;72
2014-01-25 09:49:44,409 [tabletserver.TabletServer] DEBUG: verifying extent
d;72~gcm~201304;72
2014-01-25 09:49:44,418 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,418 [tabletserver.TabletServer] INFO : Loading tablet
p;18~thm~2012101;18
2014-01-25 09:49:44,456 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: Looking at metadata
{d;72~gcm~201304 file:/b-00028xn/I0002i57.rf [] 40757550
false=1625685,0,1389849852187, d;72~gcm~201304 file:/b-00028xn/I0002i5f.rf []
40757699 false=1592695,0,1389849860660, d;72~gcm~201304
file:/t-0001ect/C0001nrv.rf [] 40418333 false=178655073,3168593,
d;72~gcm~201304 file:/t-0001rzt/C0001u7r.rf [] 40441613
false=232169298,4085196, d;72~gcm~201304 file:/t-0001rzt/C0002evg.rf []
40708788 false=281016452,4939001, d;72~gcm~201304 file:/t-0001rzt/C0002j3b.rf
[] 40750189 false=36970878,670358, d;72~gcm~201304 file:/t-0001rzt/C0002j5f.rf
[] 40756562 false=22086620,394464, d;72~gcm~201304 future:343bc1fa155242c []
41048300 false=192.168.2.233:9997, d;72~gcm~201304 last:3438da6a936264c []
40756562 false=192.168.2.225:9997, d;72~gcm~201304 srv:dir [] 40418333
false=/t-0001rzt, d;72~gcm~201304 srv:flush [] 40825738 false=9,
d;72~gcm~201304 srv:lock [] 40825738
false=tservers/192.168.2.223:9997/zlock-0000000001$243b1084d9e1273,
d;72~gcm~201304 srv:time [] 40757699 false=M1389849860660, d;72~gcm~201304
~tab:~pr [] 40418333 false=^A72}
2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: got [] for logs for
d;72~gcm~201304;72
2014-01-25 09:49:44,557 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,557 [zookeeper.ZooLock] DEBUG: Unexpected event watching
lock node WatchedEvent state:Disconnected type:None path:null
/accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/tservers/192.168.2.233:9997/zlock-0000000000
2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc
ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs
freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920
2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=6630.77MB (6952866144), Free=1561.2301MB (1637068448), Max=8192.0MB
(8589934592), Counts: Blocks=67662, Access=43916676, Hit=43363810, Miss=552866,
Evictions=58, Evicted=485201, Ratios: Hit Ratio=98.74109625816345%, Miss
Ratio=1.2588976882398129%, Evicted/Run=8365.5341796875, Duplicate Reads=0
2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=513.6937MB (538646880), Free=1534.3063MB (1608836768), Max=2048.0MB
(2147483648), Counts: Blocks=845, Access=30157409, Hit=30156512, Miss=897,
Evictions=0, Evicted=0, Ratios: Hit Ratio=99.99703168869019%, Miss
Ratio=0.0029743934646830894%, Evicted/Run=NaN, Duplicate Reads=18
2014-01-25 09:50:28,730 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for
/accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/config/tserver.client.timeout
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
{noformat}
was (Author: anthonyf):
We have ntp setup to synchronize system clocks. Here's the log from
192.168.2.233.
2014-01-25 09:49:11,708 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=483.68814MB (507183784), Free=1564.3119MB (1640299864), Max=2048.0MB
(2147483648), Counts: Blocks=835, Access=30151167, Hit=30150280,
Miss=887, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.9970555305481%, Miss
Ratio=0.0029418428312055767%, Evicted/Run=NaN, Duplicate Reads=18
2014-01-25 09:49:11,711 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=6742.9854MB (7070532384), Free=1449.0149MB (1519402208), Max=8192.0MB
(8589934592), Counts: Blocks=68811, Access=43909183, Hit=4336353
1, Miss=545652, Evictions=57, Evicted=476838, Ratios: Hit
Ratio=98.75731468200684%, Miss Ratio=1.242683082818985%,
Evicted/Run=8365.5791015625, Duplicate Reads=0
2014-01-25 09:49:11,712 [tabletserver.TabletServer] DEBUG: gc
ParNew=897.13(+17.22) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=4,064,834,072(+1,745,404,552) totalmem=12,777,553,920
2014-01-25 09:49:17,455 [tabletserver.TabletServer] DEBUG: gc
ParNew=901.79(+4.67) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,913,871,704(+667,572,648) totalmem=12,777,553,920
2014-01-25 09:49:17,871 [cache.LruBlockCache] DEBUG: Block cache LRU eviction
started. Attempting to free 859003704 bytes
2014-01-25 09:49:17,942 [cache.LruBlockCache] DEBUG: Block cache LRU eviction
completed. Freed 859019280 bytes. Priority Sizes: Single=2512.0676MB
(2634093856), Multi=4456.495MB (4672973712),Memory=0.0MB (0)
2014-01-25 09:49:19,569 [tabletserver.TabletServer] DEBUG: gc
ParNew=903.09(+1.30) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,733,299,800(-180,571,904) totalmem=12,777,553,920
2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc
ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs
freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920
2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet
d;72~gcm~201304;72
2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO :
dcloud18/192.168.2.233:9997: got assignment from master: d;72~gcm~201304;72
2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Loading extent:
d;72~gcm~201304;72
2014-01-25 09:49:44,409 [tabletserver.TabletServer] DEBUG: verifying extent
d;72~gcm~201304;72
2014-01-25 09:49:44,418 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,418 [tabletserver.TabletServer] INFO : Loading tablet
p;18~thm~2012101;18
2014-01-25 09:49:44,456 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: Looking at metadata
{d;72~gcm~201304 file:/b-00028xn/I0002i57.rf [] 40757550
false=1625685,0,1389849852187, d;72~gcm~201304 file:/b-00028xn/I0002i5f.rf []
40757699 false=1592695,0,1389849860660, d;72~gcm~201304
file:/t-0001ect/C0001nrv.rf [] 40418333 false=178655073,3168593,
d;72~gcm~201304 file:/t-0001rzt/C0001u7r.rf [] 40441613
false=232169298,4085196, d;72~gcm~201304 file:/t-0001rzt/C0002evg.rf []
40708788 false=281016452,4939001, d;72~gcm~201304 file:/t-0001rzt/C0002j3b.rf
[] 40750189 false=36970878,670358, d;72~gcm~201304 file:/t-0001rzt/C0002j5f.rf
[] 40756562 false=22086620,394464, d;72~gcm~201304 future:343bc1fa155242c []
41048300 false=192.168.2.233:9997, d;72~gcm~201304 last:3438da6a936264c []
40756562 false=192.168.2.225:9997, d;72~gcm~201304 srv:dir [] 40418333
false=/t-0001rzt, d;72~gcm~201304 srv:flush [] 40825738 false=9,
d;72~gcm~201304 srv:lock [] 40825738
false=tservers/192.168.2.223:9997/zlock-0000000001$243b1084d9e1273,
d;72~gcm~201304 srv:time [] 40757699 false=M1389849860660, d;72~gcm~201304
~tab:~pr [] 40418333 false=^A72}
2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: got [] for logs for
d;72~gcm~201304;72
2014-01-25 09:49:44,557 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:49:44,557 [zookeeper.ZooLock] DEBUG: Unexpected event watching
lock node WatchedEvent state:Disconnected type:None path:null
/accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/tservers/192.168.2.233:9997/zlock-0000000000
2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc
ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs
freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920
2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: Got loadTablet
message from user: !SYSTEM
2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=6630.77MB (6952866144), Free=1561.2301MB (1637068448), Max=8192.0MB
(8589934592), Counts: Blocks=67662, Access=43916676, Hit=43363810, Miss=552866,
Evictions=58, Evicted=485201, Ratios: Hit Ratio=98.74109625816345%, Miss
Ratio=1.2588976882398129%, Evicted/Run=8365.5341796875, Duplicate Reads=0
2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=513.6937MB (538646880), Free=1534.3063MB (1608836768), Max=2048.0MB
(2147483648), Counts: Blocks=845, Access=30157409, Hit=30156512, Miss=897,
Evictions=0, Evicted=0, Ratios: Hit Ratio=99.99703168869019%, Miss
Ratio=0.0029743934646830894%, Evicted/Run=NaN, Duplicate Reads=18
2014-01-25 09:50:28,730 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for
/accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/config/tserver.client.timeout
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
> duplicate locations
> -------------------
>
> Key: ACCUMULO-2261
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2261
> Project: Accumulo
> Issue Type: Bug
> Components: master, tserver
> Affects Versions: 1.5.0
> Environment: hadoop 2.2.0 and zookeeper 3.4.5
> Reporter: Eric Newton
> Assignee: Eric Newton
> Priority: Blocker
> Fix For: 1.5.1
>
>
> Anthony F reports the following:
> bq. I have observed a loss of data when tservers fail during bulk ingest.
> The keys that are missing are right around the table's splits indicating that
> data was lost when a tserver died during a split. I am using Accumulo 1.5.0.
> At around the same time, I observe the master logging a message about "Found
> two locations for the same extent".
> And:
> bq. I'm currently digging through the logs and will report back. Keep in
> mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack. To determine data
> loss, I have a 'ConsistencyCheckingIterator' that verifies each row has the
> expected data (it takes a long time to scan the whole table). Below is a
> quick summary of what happened. The tablet in question is "d;72~gcm~201304".
> Notice that it is assigned to 192.168.2.233:9997[343bc1fa155242c] at
> 2014-01-25 09:49:36,233. At 2014-01-25 09:49:54,141, the tserver goes away.
> Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432] and
> shortly after that, I see the BadLocationStateException. The master never
> recovers from the BLSE - I have to manually delete one of the offending
> locations.
> {noformat}
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
> tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning
> tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c]
> 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers
> [192.168.2.233:9997[343bc1fa155242c]]
> 2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers:
> [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
>
> d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null),
> d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null),
> d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;38;37@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;51;50@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;60;59@(null,192.168.2.233:9997[343bc1fa155242c],null),
> m;92;91@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;04;03@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;50;49@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;63;62@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;74;73@(null,192.168.2.233:9997[343bc1fa155242c],null),
> o;97;96@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;08~thm~20121;08@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;09~thm~20121;09@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;10;09~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;18~thm~2012101;18@(192.168.2.233:9997[343bc1fa155242c],null,null),
> p;21;20~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;22~thm~2012091;22@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;23;22~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;41~thm~2012111;41@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;42;41~thm~2012111@(null,192.168.2.233:9997[343bc1fa155242c],null),
> p;58~thm~201208;58@(null,192.168.2.233:9997[343bc1fa155242c],null)]...
> 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning
> tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
> 2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet
> d;72~gcm~201304;72 was loaded on 192.168.2.223:9997
> 2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR:
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent d;72~gcm~201304:
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> java.lang.RuntimeException:
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
> found two locations for the same extent d;72~gcm~201304:
> 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)