ctubbsii opened a new issue #1574: Periodic failure of MetadataMaxFilesIT URL: https://github.com/apache/accumulo/issues/1574 The description of this test failure (involving a split and unresponsive tserver) has elements in common with what @keith-turner described on #1076, which also had a MetadataMaxFilesIT test failure. However, I think this might be a distinct issue, unrelated to Thrift, especially since the problem reported there was mitigated by rolling back the change that caused it. The MetadataMaxFilesIT test output showed it got as far as creating 3 of the 5 tables it was trying to create, each with 1000 split points. It was in the middle of creating split points for the third table when it got stuck: ``` 2020-03-27T23:40:30,398 [functional.ConfigurableMacBase] INFO : Creating table0 2020-03-27T23:40:30,712 [functional.ConfigurableMacBase] INFO : adding splits 2020-03-27T23:40:45,004 [functional.ConfigurableMacBase] INFO : flushing 2020-03-27T23:40:45,103 [functional.ConfigurableMacBase] INFO : Creating table1 2020-03-27T23:40:45,438 [functional.ConfigurableMacBase] INFO : adding splits 2020-03-27T23:41:01,393 [functional.ConfigurableMacBase] INFO : flushing 2020-03-27T23:41:01,903 [functional.ConfigurableMacBase] INFO : Creating table2 2020-03-27T23:41:02,309 [functional.ConfigurableMacBase] INFO : adding splits 2020-03-27T23:43:10,452 [clientImpl.ThriftTransportPool] WARN : Thread "addSplits 14" stuck on IO to jenkins:34735 (120000) for at least 120049 ms 2020-03-27T23:43:10,476 [clientImpl.ThriftTransportPool] INFO : Thread "addSplits 14" no longer stuck on IO to jenkins:34735 (120000) sawError = true ``` This is where the test timed out. I looked at the tserver logs and it looks like at a certain point, a metadata tablet split: ``` 2020-03-27T23:41:10,629 [tablet.location] DEBUG: Split !0;3\\;516;2\\;875 into !0;3\\;45;2\\;875 and !0;3\\;516;3\\;45 on jenkins:34735[1000a3673370006] ``` Not long after, the tserver got into a state where the only thing it was doing was complaining about the new metadata tablet from that split having too many files, in response to repeated attempts from the client (presumably the client-side split code from the IT because it's the root user rather than the system user, but I haven't checked into how the split code works on the client side): ``` 2020-03-27T23:41:10,904 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate; 2020-03-27T23:41:10,904 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run 2020-03-27T23:41:10,916 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate; 2020-03-27T23:41:10,917 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run 2020-03-27T23:41:10,921 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate; 2020-03-27T23:41:10,922 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run 2020-03-27T23:41:10,929 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate; 2020-03-27T23:41:10,930 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run 2020-03-27T23:41:10,942 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:56900; action: authenticate; 2020-03-27T23:41:10,943 [tablet.Tablet] WARN : Tablet !0;3\\;45;2\\;875 has too many files, batch lookup can not run ``` The tserver never recovered from this state, and the test timed out. There were an average of a few dozen log lines between each of these messages at first, but at a certain point, all other tserver activity settled and the last 60,000 lines or so was just this (and periodic Java GC stats from GarbageCollectionLogger, which didn't show anything unusual). The only thing in any of the other logs regarding this tablet is in the Master logs where it acknowledges the report of the split. ``` 2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;031;030 was loaded on jenkins:34735 2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 reported split !0;3\\;45;2\\;875, !0;3\\;516;3\\;45 2020-03-27T23:41:10,783 [master.Master] INFO : Canceled migration of TKeyExtent(table:33, endRow:35 32 38, prevEndRow:35 32 34) 2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : jenkins:34735 reported split 3;526;524, 3;528;526 2020-03-27T23:41:10,783 [master.EventCoordinator] INFO : tablet 2;036;035 was loaded on jenkins:34735 ``` The Master actually did it's own spamming before the test died, complaining about 4 offline tablets, that were blocking balancing. These tablets did not seem to ever come online: ``` 2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds 2020-03-27T23:43:13,845 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds 2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets]: scan time 0.00 seconds 2020-03-27T23:43:13,862 [master.Master] DEBUG: [Metadata Tablets] sleeping for 60.00 seconds 2020-03-27T23:43:23,508 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:43:23,508 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:43:33,513 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:43:33,513 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:43:43,518 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:43:43,518 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:43:53,522 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:43:53,522 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:03,527 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:44:03,527 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:13,532 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.01 seconds 2020-03-27T23:44:13,532 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds 2020-03-27T23:44:13,947 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds 2020-03-27T23:44:13,968 [master.Master] DEBUG: [Metadata Tablets]: scan time 0.00 seconds 2020-03-27T23:44:13,969 [master.Master] DEBUG: [Metadata Tablets] sleeping for 60.00 seconds 2020-03-27T23:44:23,536 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:44:23,536 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:33,541 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:44:33,541 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:43,546 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:44:43,546 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:44:53,550 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:44:53,550 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 2020-03-27T23:45:03,555 [master.Master] DEBUG: Finished gathering information from 2 of 2 servers in 0.00 seconds 2020-03-27T23:45:03,555 [master.Master] DEBUG: not balancing because there are unhosted tablets: 4 ``` The second tserver had no problematic activity. It showed repeated scans of the root tablet (which it was apparently hosting). ``` 2020-03-27T23:45:12,058 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}}; 2020-03-27T23:45:12,060 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate; 2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate; 2020-03-27T23:45:12,061 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}}; 2020-03-27T23:45:12,062 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate; 2020-03-27T23:45:12,063 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;76%00; : [] 9223372036854775807 false), [!0;~%00; : [] 9223372036854775807 false,!0<%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}}; 2020-03-27T23:45:12,064 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate; 2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: authenticate; 2020-03-27T23:45:12,065 [accumulo.audit] INFO : operation: permitted; user: root; client: 172.31.8.253:57824; action: scan; targetTable: accumulo.root; authorizations: ; range: {+r<<=[[!0;2;875%00; : [] 9223372036854775807 false,!0;3;45%00; : [] 9223372036854775807 false)]}; columns: [loc::, ~tab:~pr:]; iterators: [IterInfo(priority:10000, className:org.apache.accumulo.core.iterators.user.WholeRowIterator, iterName:WRI)]; iteratorOptions: {WRI={}}; ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
