[ https://issues.apache.org/jira/browse/ACCUMULO-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634471#comment-15634471 ]
Adam J Shook commented on ACCUMULO-4506: ---------------------------------------- There are two znodes under the {locks} node, one for each file. They belong to different tservers, which I located by checking the same {ephemeralOwner} from the nodes under {tservers}. {noformat} [zk: host:2181(CONNECTED) 9] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/ae4b03ec-159b-44e8-9a88-ccf7fa849c19|peer_instance|5h|k ephemeralOwner = 0x357d1bf618f80ad [zk: host:2181(CONNECTED) 14] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/9f038f64-4252-44a0-bfd0-99d4a316b397|peer_instance|5g|j ephemeralOwner = 0x357d1bf618f4f72 [zk: host:2181(CONNECTED) 12] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31658/zlock-0000000000 TSERV_CLIENT=host:31658 ephemeralOwner = 0x357d1bf618f80ad [zk: host:2181(CONNECTED) 13] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31368/zlock-0000000000 TSERV_CLIENT=host:31368 ephemeralOwner = 0x357d1bf618f4f72 {noformat} Unfortunately, we don't keep logs around long enough to see back when these files were initially assigned. We only have data back to October 27th -- a Kibana search for the WAL UUID only returns log entries from the Master and the GC. For what it is worth, we've been trying out replication and are seeing some behavior we can't really explain without digging into it a lot more (source code included). The time frame between a WAL file being closed and it actually being replicated seems to take a lot longer than I would expect it to -- anywhere from five minutes to a couple hours. I see a lot of log entries saying saying work is being scheduled, but it takes a while to see the work being done. This particular cluster has four tablet servers and there are always 40-60 files pending replication, with files rarely "in-progress" (besides these two problematic files). It seems to replicate in waves, and I haven't put my finger on when files move from pending to in-progress. With that said, things *are* replicating, it just seems to be taking a while longer than we anticipated. Not sure if this is the way it is, or something else is going on. > Some in-progress files for replication never replicate > ------------------------------------------------------- > > Key: ACCUMULO-4506 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4506 > Project: Accumulo > Issue Type: Bug > Components: replication > Affects Versions: 1.7.2 > Reporter: Adam J Shook > > We're seeing an issue with replication where two files have been in-progress > for a long time and based on the logs are not going to be replicated. The > metadata from the {{accumulo.replication}} table looks a little funky, with a > very large {{begin}} value. > *Logs* > {noformat} > 2016-11-02 19:52:50,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: > Not queueing work for > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > to Remote Name: peer_instance Remote identifier: 5h Source Table ID: k > because [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477314365827] doesn't need replication > 2016-11-02 19:53:08,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: > Not queueing work for > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > to Remote Name: peer_instance Remote identifier: 5i Source Table ID: l > because [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477052816174] doesn't need replication > {noformat} > *Replication table* > {noformat} > scan -r > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > -t accumulo.replication > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > repl:j [] [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: > 1477314369633] > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > repl:k [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: > true createdTime: 1477314365827] > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > repl:l [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: > true createdTime: 1477314365707] > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j > [] [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: > 1477314369633] > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k > [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477314365827] > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l > [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477314365707] > scan -r > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > -t accumulo.replication > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > repl:j [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: > true createdTime: 1477052819752] > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > repl:k [] [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: > 1477052816238] > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > repl:l [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: > true createdTime: 1477052816174] > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j > [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477052819752] > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k > [] [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: > 1477052816238] > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l > [] [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true > createdTime: 1477052816174] > {noformat} > *HDFS* > {noformat} > hdfs dfs -ls > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > -rwxr-xr-x 3 ubuntu supergroup 1117650900 2016-10-24 13:09 > hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 > -rwxr-xr-x 3 ubuntu supergroup 1171968390 2016-10-21 12:31 > hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)