ccoffline opened a new issue #6386: URL: https://github.com/apache/incubator-doris/issues/6386
Commit #3775 that introduce table lock has many critical bugs that make all FE crush. These bugs are all caused by NPE while replaying editlogs. see #5378 #5391 #5688 #5973 #6155 In the ideal case, replaying editlogs should not throw NPE because the ops before has already check the preconditions and in the correct order. #3775 did not keep edit ops synchronized at DB level and upset the order of editlogs. This will happens when edit ops on a table execute concurrently with another ops on that table with a db write lock, such as drop/replace/rename. While replaying disordered editlogs, the meta may be inconsistent bewteen MASTER and FOLLOWER. For instance, if an editlog on a table is come after the editlog to drop that table, the current approach is to simply ignore this editlog. But if user recover that table afterwards, the edit ops on the dropped table object will be recovered in MASTER but lost by FOLLOWER. It gets more complicated with ops replace/rename. Its necessary to make sure the meta is 100% consistent. However, The replay bugs will make the cluster completely unavailable, so it's the most urgent task for us to avoid any NPE during replay. When designing the fix, we took the following factors into primary consideration * The return of `Catalog.getDb` and `Database.getTable` should only returns Optional, or return non-null value and throw exception if null. The caller can directly call `Optional.get` or `Optional.orElse(null)` after considering the null situation and the reviewer can easily notice potential NPEs. * In the replay routine, `Catalog.getDb` and `Database.getTable` can throw `MetaNotFoundException` and caught by `Editlog.loadJournal`. This indicates that meta may be inconsistent, so we need log a warning for tracking. * This fix should only focus on avoiding replay NPEs and be consistent with the original process logic. Mark any potential bug if possible, That can be discussed later, such as dropping one database concurrently that hardly ever happens in the reality. After this, there is still a lot of works. We need to fix many inconsistent lock routines and develop concurrent test on meta. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
