Awesome hunting and write-up, Dylan!
Dylan Hutchison wrote:
Hi all,
Check out ACCUMULO-4229
<https://issues.apache.org/jira/browse/ACCUMULO-4229>. I copy the text
below to include it in our discussion.
This issue came up again during a similar long-lasting BatchWrite. It had
the same circumstances: the failures started happening after a split event.
I turned on TRACE logs and I think I pinned it down: the TabletLocator
cached by a BatchWriter gets out of sync with the static cache of
TabletLocators.
1. The TabletServerBatchWriter caches a TabletLocator from the static
collection of TabletLocators when it starts writing. Suppose it is writing
to tablet T1.
2. The TabletServerBatchWriter uses its locally cached TabletLocator
inside its `binMutations` method for its entire lifespan; this cache is
never refreshed or updated to sync up with the static collection of
TabletLocators.
3. Every hour, the static collection of TabletLocators clears itself.
The next call to get a TabletLocator from the static collection allocates a
new TabletLocator. Unfortunately, the TabletServerBatchWriter does not
reflect this change and continues to use the old, locally cached
TabletLocator.
4. Tablet T1 splits into T2 and T3, which closes T1. As such, it no
longer exists and the tablet server that receives the entries meant to go
to T1 all fail to write because T1 is closed.
5. The TabletServerBatchWriter receives the response from the tablet
server that all entries failed to write. It invalidates the cache of the
*new* TabletLocator obtained from the static collection of TabletLocators.
The old TabletLocator that is cached locally does not get invalidated.
6. The TabletServerBatchWriter re-queues the failed entries and tries to
write them to the same closed tablet T1, because it is still looking up
tablets using the old TabletLocator.
This behavior subsumes the circumstances William wrote about in the thread
<https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E>
he mentioned. The problem would occur as a result of either splits or
major compactions. It would only stop the BatchWriter if its entire memory
filled up with writes to the same tablet that was closed as a result of a
majc or split; otherwise it would just slow down the BatchWriter by failing
to write some number of entries with every RPC.
There are a few solutions we can think of.
1. Not have the MutationWriter inside the TabletServerBatchWriter
locally cache TabletLocators. I suspect this was done for performance
reasons, so it's probably not a good solution.
2. Have all the MutationWriters clear their cache at the same time the
static TabletLocator cache clears. I like this one. We could store a
reference to the Map that each MutationWriter has inside a static
synchronized WeakHashMap. The only time the weak map needs to be accessed
is:
1. When a MutationWriter is constructed (from constructing a
TabletServerBatchWriter), add its new local TabletLocator cache
to the weak
map.
2. When the static TabletLocator cache is cleared, also clear every
map in the weak map.
3. Another solution is to make the invalidate calls on the local
TabletLocator cache rather than the global static one. If we go this route
we should double check the idea to make sure it does not impact the
correctness of any other pieces of code that use the cache. I like the
previous idea better.
The TimeoutTabletLocator does not help when no timeout is set on the
BatchWriter (the default behavior).
On Tue, Apr 19, 2016 at 8:09 PM, William Slacum<[email protected]> wrote:
Good digs, Dylan. I don't think it's too rare to matter. I notice often
during MR jobs, and there's usually a point where I give up and just start
writing RFiles.
It could possibly be related to what I saw back in the dayoday with:
https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E
On Tue, Apr 19, 2016 at 6:26 PM, Josh Elser<[email protected]> wrote:
Nice findings. Sorry I haven't had any cycles to dig into this myself.
I look forward to hearing what you find :)
Dylan Hutchison wrote:
I investigated a bit more and I am pretty sure the problem is that the
BatchWriter is not recognizing that the tablet vb<< split into vb;2436<
and
vb<;2436. It keeps trying to update the closed tablet vb<<. Each
update
writes 0 mutations and records a failure at the tablet server
UpdateSession
because vb<< is closed.
I'm not sure why this is happening because the BatchWriter should have
invalidated its tablet locator cache upon recognizing a failure. Then
it
would recognize that the entries it wants to write fall into the new
tablets vb;2436< and vb<;2436. I think there is a timing bug for this
edge
case, when a table split occurs during heavy writes.
I will write this up if I can reproduce it. Maybe it is too rare to
matter.
Cheers, Dylan
On Mon, Apr 18, 2016 at 2:38 PM, Dylan Hutchison<
[email protected]
wrote:
Hi devs,
I'd like to ask your help in figuring out what is happening to a
BatchWriter. The following gives my reasoning so far.
In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status
in
its addMutation method. I saw that it is stuck by jstack'ing the
Accumulo
client. It's been stuck like this for 16 hours.
The BatchWriter is supposed to wait when a mutation is added if no
failures have recorded and either (a) the total memory used exceeds the
maximum allowed for the BatchWriter, or (b) the batchwriter is
currently
flushed. So we conclude that one of (a) or (b) have occurred and no
failures were recorded, at the time when addMutation was called. I
think
(a) is likely.
The BatchWriter is supposed to notify itself when either (1) a flush
finishes, (2) a constraint violation or authorization failure or server
error or unknown error occurs, (3) memory usage decreases, which
happens
when entries successfully send to the tablet server. Since the
BatchWriter
is stuck on WAITING, none of these conditions are occurring.
The BatchWriter has 3 write threads (the default number). All three
have
status TIMED_WAITING (parked) in jstack. Their stack traces don't give
useful information.
Here's what I can tell from the tserver logs. A new table (and tablet)
was created successfully. The BatchWriter started writing to this
tablet
steadily. The logs show that the tablet (vb<<) flushed every 5 seconds
or
so and major compacted at a steady periodic rate.
Everything looks good, until vb<< grew large enough that it needed
splitting. This occurred about 42 minutes after the BatchWriter
started
writing entries. The logs show a failure in an UpdateSession that
popped
up in the middle of the split operation. This failure continues to
show
for the next 15 hours.
I copied the portion of the tserver logs that look relevant to the
split
below. I highlighted the line reporting the first failure. It occurs
in
between when the split starts and when it finishes.
Any idea what could have caused this? I don't know if the failure is
related to the BatchWriter being stuck in WAITING. It seems likely. I
think it is weird that the 3 write threads are all idle; at least one
of
them should be doing something if the thread calling addMutation() is
waiting.
Here is a pastebin of the jstack<http://pastebin.com/TLfdrDwX>,
though I
think I wrote the useful parts from it.
2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<< closed
2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split
vb;2436<
[hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf,
hdfs:
//localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
hdfs://localhost:90
00/accumulo/tables/vb/default_tablet/M0000a9g.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split
vb<;2436
[hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf,
hdfs
://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf,
hdfs://localhost:9
000/accumulo/tables/vb/default_tablet/M0000a9g.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf,
hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf]
2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1]
2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<< split
vb;2436<
vb<;2436
2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time :
3.22
secs
*2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split:
vb<<*
2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess
127.0.0.1:41090 0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s
lt=0.000s ct=0.000s)
2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map
0x0000000009a20e60
*2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1,
first
extent vb<< successful commits: 0*
2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints
for
extent: vb;2436<
2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded
constraint
org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
for vb
2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436< opened
2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map
0x0000000003edf090
2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints
for
extent: vb<;2436
2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded
constraint
org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
for vb
2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened
*2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split:
vb<<
size0 539900144 size1 539900148 time 8268ms*
2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs
tablets:1
ranges:1)
2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs
tablets:1
ranges:1)
2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking
for work in
/accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq
2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess
127.0.0.1:41090 0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s
lt=0.000s ct=0.000s)
2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1,
first
extent vb<< successful commits: 0