GeorgeJahad opened a new pull request #2703: URL: https://github.com/apache/ozone/pull/2703
## What changes were proposed in this pull request? When the OM-HA code was added, the tests were not completely updated. Many of them still call the OzoneManagerProtocol "write-path" methods even though those methods have been superseded by the HA methods. For this reason the obsolete protocol methods in the OzoneManager class could not be removed. This PR replaces all those obsolete methods with default methods in OzoneManagerProtocol.java that just throw an UnsupportedOperationException. It also fixes the tests broken by those changes. ## Unremoved methods The following methods are in the OzoneManager class, on the write path. They were not removed because they are invoked from with the OmRequest subclasses: ``` OzoneManager::finalizeUpgrade OzoneManager::getDelegationToken OzoneManager::renewDelegationToken OzoneManager::cancelDelegationToken ``` ## Other dead code removed In my testing, I noticed a couple of other methods that are no longer used and removed them as well: OzoneManager::checkVolumeAccess is unsupported in the [VolumeManagerImpl](https://github.com/apache/ozone/blob/fcd016021452d14371e3335982ee0f7cb72fac5c/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/VolumeManagerImpl.java#L386) [OzoneManagerRequestHandler::allocateBlock](https://github.com/apache/ozone/blob/e5c647eb65b896d629f8e20e066cccae32834d80/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerRequestHandler.java#L458) is a private method that is not called from anywhere ## Complex Fixes Most of the broken tests were easy to fix. Basically any the tests that invoked the removed methods directly from the OzoneManager.class were modified to invoke them from an OzoneManager client. There were a couple of difficult areas: ### TestOmMetrics class These tests verify that the correct number of metrics are being generated. Most of them use mocks invoked by the methods removed in this PR. So they had to be significantly restructured. In most cases the mocks had to be replaced with a miniCluster, (because there is no other way to generate the metrics.) My goal was to fix the tests so the number of metrics would not change. I was able to do that in all cases except one. One of the tests uses the checkVolumeAccess() method, which was removed months ago, (as mentioned above). The old mocks were incorrectly ignoring the exception it has been generating. So I removed it from the test and corrected the corresponding counts. The rest of the counts are unchanged. ### listStatus() race condition while key is in cache Under certain circumstances, listStatus() fails to return a key that is in the cache, (but not in the keyTable). This race did not occur prior to this PR because the test was using the pre-HA version which doesn't use the cache. Since it works when the key is in the table, I fixed the test by retrying until it is read from the table, but it is possible that listStatus() itself should be fixed. The details of the race are a bit complicated, but explained below. #### listStatus() race condition details With the commitKey OmRequest, the key is first written to the keyTableCache and then to the keyTable when the double buffer is flushed. If the OZONE_OM_ENABLE_FILESYSTEM_PATHS is enabled, and the key looks like a file name, the key's superdirectory's are added to cache during the openKey [OmRequest](https://github.com/apache/ozone/blob/af5b48e4571e74fcad2e4546ff0efe79c3a96350/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyCreateRequest.java#L270) If OZONE_OM_ENABLE_FILESYSTEM_PATHS is not enabled, the superdirectories are not added in the openkey request. The cache search method in listStatus() is written to skip any keys with an embedded "/" [here](https://github.com/apache/ozone/blob/af5b48e4571e74fcad2e4546ff0efe79c3a96350/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L2285) So, if OZONE_OM_ENABLE_FILESYSTEM_PATHS is disabled, and a key with an embedded "/" is commited, and it is still in the cache, listStatus() won't see it. listStatus() does return the key when it gets flushed to the table, so sleeping for a bit fixes the problem. I've only noticed this happening on one test and only 20% of the time. (I've put a sleep in to handle that case.) But we should discuss if that is good enough. ## Genesis Broken in Master `ozone genesis -b BenchMarkOzoneManager` uses the pre-HA code and so should also be updated for these changes. It is currently broken in the master branch. It generates this exception when I try to run it in master: ``` org.apache.hadoop.metrics2.MetricsException: Metrics source DBCheckpointMetrics already exists! ``` I think that error will need to be fixed before I can update it, so I've left it out of this PR. Let me know if that's a problem. Since the CI tests don't run Genesis, the lack of updates don't cause the tests to fail. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-3371 ## How was this patch tested? All effected tests were updated and all pass without fail. I timed all the changed tests. Before the changes, they took 12 minutes on average. Now they take 12.5 minutes on average. I ran the CI 12 times each on master and on this branch. master failed 6 of those times, this branch failed 6 of those times. While there was some overlap in the failures, some were unique to master and some to this branch. As best I could tell, none of the failures were related to these changes, but it is hard to be sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
