I investigated a bit more and I am pretty sure the problem is that the BatchWriter is not recognizing that the tablet vb<< split into vb;2436< and vb<;2436. It keeps trying to update the closed tablet vb<<. Each update writes 0 mutations and records a failure at the tablet server UpdateSession because vb<< is closed.
I'm not sure why this is happening because the BatchWriter should have invalidated its tablet locator cache upon recognizing a failure. Then it would recognize that the entries it wants to write fall into the new tablets vb;2436< and vb<;2436. I think there is a timing bug for this edge case, when a table split occurs during heavy writes. I will write this up if I can reproduce it. Maybe it is too rare to matter. Cheers, Dylan On Mon, Apr 18, 2016 at 2:38 PM, Dylan Hutchison <[email protected] > wrote: > Hi devs, > > I'd like to ask your help in figuring out what is happening to a > BatchWriter. The following gives my reasoning so far. > > In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status in > its addMutation method. I saw that it is stuck by jstack'ing the Accumulo > client. It's been stuck like this for 16 hours. > > The BatchWriter is supposed to wait when a mutation is added if no > failures have recorded and either (a) the total memory used exceeds the > maximum allowed for the BatchWriter, or (b) the batchwriter is currently > flushed. So we conclude that one of (a) or (b) have occurred and no > failures were recorded, at the time when addMutation was called. I think > (a) is likely. > > The BatchWriter is supposed to notify itself when either (1) a flush > finishes, (2) a constraint violation or authorization failure or server > error or unknown error occurs, (3) memory usage decreases, which happens > when entries successfully send to the tablet server. Since the BatchWriter > is stuck on WAITING, none of these conditions are occurring. > > The BatchWriter has 3 write threads (the default number). All three have > status TIMED_WAITING (parked) in jstack. Their stack traces don't give > useful information. > > Here's what I can tell from the tserver logs. A new table (and tablet) > was created successfully. The BatchWriter started writing to this tablet > steadily. The logs show that the tablet (vb<<) flushed every 5 seconds or > so and major compacted at a steady periodic rate. > > Everything looks good, until vb<< grew large enough that it needed > splitting. This occurred about 42 minutes after the BatchWriter started > writing entries. The logs show a failure in an UpdateSession that popped > up in the middle of the split operation. This failure continues to show > for the next 15 hours. > > I copied the portion of the tserver logs that look relevant to the split > below. I highlighted the line reporting the first failure. It occurs in > between when the split starts and when it finishes. > > Any idea what could have caused this? I don't know if the failure is > related to the BatchWriter being stuck in WAITING. It seems likely. I > think it is weird that the 3 write threads are all idle; at least one of > them should be doing something if the thread calling addMutation() is > waiting. > > Here is a pastebin of the jstack <http://pastebin.com/TLfdrDwX>, though I > think I wrote the useful parts from it. > > 2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<< closed > 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split > vb;2436< > [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs: > //localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf, > hdfs://localhost:90 > 00/accumulo/tables/vb/default_tablet/M0000a9g.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf] > 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for high split > vb<;2436 > [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00008lh.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C00009iz.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a08.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a4t.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a7m.rf, hdfs > ://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8f.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C0000a8n.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8p.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/F0000a8q.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a93.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000a9b.rf, > hdfs://localhost:9 > 000/accumulo/tables/vb/default_tablet/M0000a9g.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aa7.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000aap.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abe.rf, > hdfs://localhost:9000/accumulo/tables/vb/default_tablet/M0000abn.rf] > 2016-04-17 22:38:06,440 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:42302 !0 0 entries in 0.00 secs, nbTimes = [0 0 0.00 1] > 2016-04-17 22:38:06,472 [tablet.Tablet] TABLET_HIST: vb<< split vb;2436< > vb<;2436 > 2016-04-17 22:38:06,472 [tablet.Tablet] DEBUG: offline split time : 3.22 > secs > *2016-04-17 22:38:06,472 [tserver.TabletServer] INFO : Starting split: > vb<<* > 2016-04-17 22:38:06,472 [tserver.TabletServer] DEBUG: UpSess > 127.0.0.1:41090 0 in 3.059s, at=[0 0 0.00 1] ft=3.010s(pt=3.010s > lt=0.000s ct=0.000s) > 2016-04-17 22:38:06,472 [tserver.NativeMap] DEBUG: Allocated native map > 0x0000000009a20e60 > > *2016-04-17 22:38:06,473 [tserver.TabletServer] DEBUG: Failures: 1, first > extent vb<< successful commits: 0* > > 2016-04-17 22:38:06,473 [tablet.Tablet] DEBUG: Reloading constraints for > extent: vb;2436< > 2016-04-17 22:38:06,473 [constraints.ConstraintChecker] DEBUG: Loaded > constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint > for vb > 2016-04-17 22:38:06,474 [tablet.Tablet] TABLET_HIST: vb;2436< opened > 2016-04-17 22:38:06,474 [tserver.NativeMap] DEBUG: Allocated native map > 0x0000000003edf090 > 2016-04-17 22:38:06,474 [tablet.Tablet] DEBUG: Reloading constraints for > extent: vb<;2436 > 2016-04-17 22:38:06,474 [constraints.ConstraintChecker] DEBUG: Loaded > constraint org.apache.accumulo.core.constraints.DefaultKeySizeConstraint > for vb > 2016-04-17 22:38:06,475 [tablet.Tablet] TABLET_HIST: vb<;2436 opened > *2016-04-17 22:38:06,475 [tserver.TabletServer] INFO : Tablet split: vb<< > size0 539900144 size1 539900148 time 8268ms* > 2016-04-17 22:38:06,594 [tserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 > ranges:1) > 2016-04-17 22:38:06,597 [tserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:51092 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 > ranges:1) > 2016-04-17 22:38:07,568 [tserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:42307 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 2016-04-17 22:38:07,571 [zookeeper.DistributedWorkQueue] DEBUG: Looking > for work in /accumulo/c7998e9d-129d-4c61-9732-45bb6bcab3f0/bulk_failed_copyq > 2016-04-17 22:38:07,940 [tserver.TabletServer] DEBUG: UpSess > 127.0.0.1:41090 0 in 0.067s, at=[1 1 1.00 1] ft=0.000s(pt=0.000s > lt=0.000s ct=0.000s) > 2016-04-17 22:38:07,941 [tserver.TabletServer] DEBUG: Failures: 1, first > extent vb<< successful commits: 0 > > > > > > > > > > > > >
