ctubbsii opened a new issue #1574: Periodic failure of MetadataMaxFilesIT
URL: https://github.com/apache/accumulo/issues/1574
 
 
   The description of this test failure (involving a split and unresponsive 
tserver) has elements in common with what @keith-turner described on #1076, 
which also had a MetadataMaxFilesIT test failure. However, I think this might 
be a distinct issue, unrelated to Thrift, especially since the problem reported 
there was mitigated by rolling back the change that caused it.
   
   The MetadataMaxFilesIT test output showed it got as far as creating 3 of the 
5 tables it was trying to create, each with 1000 split points. It was in the 
middle of creating split points for the third table when it got stuck:
   
   ```
   2020-03-27T23:40:30,398 [functional.ConfigurableMacBase] INFO : Creating 
table0
   2020-03-27T23:40:30,712 [functional.ConfigurableMacBase] INFO : adding splits
   2020-03-27T23:40:45,004 [functional.ConfigurableMacBase] INFO : flushing
   2020-03-27T23:40:45,103 [functional.ConfigurableMacBase] INFO : Creating 
table1
   2020-03-27T23:40:45,438 [functional.ConfigurableMacBase] INFO : adding splits
   2020-03-27T23:41:01,393 [functional.ConfigurableMacBase] INFO : flushing
   2020-03-27T23:41:01,903 [functional.ConfigurableMacBase] INFO : Creating 
table2
   2020-03-27T23:41:02,309 [functional.ConfigurableMacBase] INFO : adding splits
   2020-03-27T23:43:10,452 [clientImpl.ThriftTransportPool] WARN : Thread 
"addSplits 14" stuck on IO to jenkins:34735 (120000) for at least 120049 ms
   2020-03-27T23:43:10,476 [clientImpl.ThriftTransportPool] INFO : Thread 
"addSplits 14" no longer stuck on IO to jenkins:34735 (120000) sawError = true
   ```
   
   This is where the test timed out.
   
   I looked at the tserver logs and it looks like at a certain point, a 
metadata tablet split:
   
   ```
   2020-03-27T23:41:10,629 [tablet.location] DEBUG: Split !0;3\\;516;2\\;875 
into !0;3\\;45;2\\;875 and !0;3\\;516;3\\;45 on jenkins:34735[1000a3673370006]
   ```
   
   Not long after, the tserver got into a state where the only thing it was 
doing was complaining about the new metadata tablet from that split having too 
many files, in response to repeated attempts from the client (presumably the 
client-side split code from the IT because it's the root user rather than the 
system user, but I haven't checked into how the split code works on the client 
side):
   
   ```
   2020-03-27T23:41:10,904 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:56900; action: authenticate;
   2020-03-27T23:41:10,904 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has 
too many files, batch lookup can not run
   2020-03-27T23:41:10,916 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:56900; action: authenticate;
   2020-03-27T23:41:10,917 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has 
too many files, batch lookup can not run
   2020-03-27T23:41:10,921 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:56900; action: authenticate;
   2020-03-27T23:41:10,922 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has 
too many files, batch lookup can not run
   2020-03-27T23:41:10,929 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:56900; action: authenticate;
   2020-03-27T23:41:10,930 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has 
too many files, batch lookup can not run
   2020-03-27T23:41:10,942 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:56900; action: authenticate;
   2020-03-27T23:41:10,943 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has 
too many files, batch lookup can not run
   ```
   
   The tserver never recovered from this state, and the test timed out. There 
were an average of a few dozen log lines between each of these messages at 
first, but at a certain point, all other tserver activity settled and the last 
60,000 lines or so was just this (and periodic Java GC stats from 
GarbageCollectionLogger, which didn't show anything unusual).
   
   The only thing in any of the other logs regarding this tablet is in the 
Master logs where it acknowledges the report of the split.
   
   ```
   2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;031;030 
was loaded on jenkins:34735
   2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 
reported split !0;3\\;45;2\\;875, !0;3\\;516;3\\;45
   2020-03-27T23:41:10,783 [master.Master] INFO : Canceled migration of 
TKeyExtent(table:33, endRow:35 32 38, prevEndRow:35 32 34)
   2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 
reported split 3;526;524, 3;528;526
   2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;036;035 
was loaded on jenkins:34735
   ```
   
   The Master actually did it's own spamming before the test died, complaining 
about 4 offline tablets, that were blocking balancing. These tablets did not 
seem to ever come online:
   
   ```
   2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table]: scan time 0.00 
seconds
   2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table] sleeping for 
60.00 seconds
   2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets]: scan time 
0.00 seconds
   2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets] sleeping 
for 60.00 seconds
   2020-03-27T23:43:23,508 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:43:23,508 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:43:33,513 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:43:33,513 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:43:43,518 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:43:43,518 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:43:53,522 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:43:53,522 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:03,527 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:44:03,527 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:13,532 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.01 seconds
   2020-03-27T23:44:13,532 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table]: scan time 0.00 
seconds
   2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table] sleeping for 
60.00 seconds
   2020-03-27T23:44:13,968 [master.Master] DEBUG: [Metadata Tablets]: scan time 
0.00 seconds
   2020-03-27T23:44:13,969 [master.Master] DEBUG: [Metadata Tablets] sleeping 
for 60.00 seconds
   2020-03-27T23:44:23,536 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:44:23,536 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:33,541 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:44:33,541 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:43,546 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:44:43,546 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:44:53,550 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:44:53,550 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   2020-03-27T23:45:03,555 [master.Master] DEBUG: Finished gathering 
information from 2 of 2 servers in 0.00 seconds
   2020-03-27T23:45:03,555 [master.Master] DEBUG: not balancing because there 
are unhosted tablets: 4
   ```
   
   The second tserver had no problematic activity. It showed repeated scans of 
the root tablet (which it was apparently hosting).
   
   ```
   2020-03-27T23:45:12,058 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; 
authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 
false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 
9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: 
[loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, 
className:org.apache.accumulo.core.iterators.user.WholeRowIterator, 
iterName:WRI)]; iteratorOptions: {WRI={}};
   2020-03-27T23:45:12,060 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: authenticate;
   2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: authenticate;
   2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; 
authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 
false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, 
~tab:~pr:]; iterators: [IterInfo(priority:10000, 
className:org.apache.accumulo.core.iterators.user.WholeRowIterator, 
iterName:WRI)]; iteratorOptions: {WRI={}};
   2020-03-27T23:45:12,062 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: authenticate;
   2020-03-27T23:45:12,063 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; 
authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 
false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 
9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: 
[loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, 
className:org.apache.accumulo.core.iterators.user.WholeRowIterator, 
iterName:WRI)]; iteratorOptions: {WRI={}};
   2020-03-27T23:45:12,064 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: authenticate;
   2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: authenticate;
   2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: 
root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; 
authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 
false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, 
~tab:~pr:]; iterators: [IterInfo(priority:10000, 
className:org.apache.accumulo.core.iterators.user.WholeRowIterator, 
iterName:WRI)]; iteratorOptions: {WRI={}};
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to