Yingyi Bu has posted comments on this change. Change subject: [ASTERIXDB-1943][API][STO] Make rebalance idempotent. ......................................................................
Patch Set 14: (14 comments) https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java File asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java: PS12, Line 74: Future > parametrize Future throughout this class? Couldn't understand what exactly you mean? PS12, Line 77: rebalanceFutureTerminated > I don't understand the need for this queue and the use of the countdownlatc Basically if the Future.get() is called after Future.cancel(true), you get CancellationException immediately and cannot track if the Future is done or not. Here, I want to make sure that post(..) doesn't return to the user until the rebalance thread is all done. See the selected answer here: https://stackoverflow.com/questions/21068547/how-to-interrupt-a-future-but-still-wait-for-it-to-finish PS12, Line 174: : getAllDatasetsForRebalance(dataverseName); : for (Dataset dataset : datasets) { : // > Log bad requests? Done PS12, Line 159: lanceTasks.add(task); : rebalanceFutureTerminated.add(terminated); : return terminated; : } : : // Performs the actual rebalance. : private void doRebalance(String dataverseName, String datasetName, String[] targetNodes, IServletResponse response, : CountDownLatch terminated) { : try { : // Sets the content type. : HttpUtil.setContentType(response, HttpUtil.ContentType.APPLICATION_JSON, HttpUtil.Encoding.UTF8); : : if (datasetName == null) { : // Rebalances datasets in a given dataverse or all non-metadata datasets. : List<Dataset> datasets = dataverseName == null ? getAllDatasetsForRebalance() : : getAllDatasetsForRebalance(dataverseName); : for (Dataset dataset : datasets) { : // By the time rebalanceDataset(...) is called, the dataset could have been dropped. : // If that's the case, rebalanceDataset(...) would be a no-op. : rebalanceDataset(dataset.getDataverseName(), dataset.getDatasetName(), targetNodes); : } : } else { : // Rebalances a given dataset from its current locations to the target nodes. : rebalanceDataset(dataverseName, datasetName, targetNodes); : } : > why not validate before submitting the task? Done https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java File asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java: PS12, Line 132: / Up to this point, since the bulk part of a rebalance operation is done, : // the following two operations will retry after interrupt and finally rethrow InterruptedException, : // which means that they will always succeed and could possibly throw InterruptedException as the last step. : // TODO(yingyi): ASTERIXDB-1948, in case a crash happens, currently the system will either: > Create an issue? The issue is ASTERIXDB-1948. I've updated the comments too. PS12, Line 148: } > why introduce a new interface? Tired Callable<Void>, it in cascade requires the involved methods to return null for the return type Void. The interface is private, so nobody else can see/use it. PS12, Line 167: } > Log too when that happens. maybe along with the number of interrupted attem RebalanceWithCancellationIT can cover it, sporadically. Without the retry loop, sometimes I see that the node group name returned from test queries have the name with suffix as a UUID, which means the nodegroup for the source was leaked due to interruption. For instance, see asterixdb/asterix-app/src/test/resources/runtimets/queries_sqlpp/rebalance/single_dataset/single_dataset.11.query.sqlpp https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java File asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java: PS12, Line 96: rc == 200 || rc == 404 > the coverage of this test is weak and could miss regressions. Create tests Regressions? >> the coverage of this test is weak The coverage is lifted in RebalanceWithCancellationIT -- it executes each single tests 20 times and interrupts at different time points. It's a new test RebalanceWithCancellationIT. >> Also cover the test case for the cancellation during the dataset switch! This seems challenging... I don't have a good idea as it involves MetadataManager/MetadataNode RPC and we need to make sure the RPC gets interrupted. It don't think it will go too bad if we leave the coverage for that blank: (1) it's a very quick operation, e.g., 10ms (2) rebalanceSwitch() is wrapped within a metadata txn, if anything fails, it aborts the metadata txn, which cancels the dataset metadata entity update. (3) if it's crashed, either the dataset is switched or not, we can still continue as discussed in: https://cwiki.apache.org/confluence/display/ASTERIXDB/Rebalance+API+and+Internal+Implementation https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java File asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java: PS12, Line 274: > this error message should be fixed. preferably with an error code! Done PS12, Line 280: } : } finally { : // Regardless of what exception is thrown in the try-block (e.g., line 279), : // we have to un-touch the index and dataset. : if (iInfo != null) { : iInfo.untouch(); : } > Can we add a comment explaining in what scenario can each of these two case Done https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java File asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java: PS12, Line 219: resourceCache.invalidate(relativePath); : resourceFile.delete(); : } finally { : // Regardless of successfully deleted o > I think this whole method should be atomic. an easy way to do this is to ma >>I think this whole method should be atomic. >>an easy way to do this is to make it un-interruptible. I did some experiments, it turns out that File.delete() cannot be interrupted. >>note that as per the comment, >>this could lead to a resource being deleted >>locally but not on replica since the code below will not be executed! Done. https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java: PS12, Line 49: e index instance > I think this interface shouldn't change and we can keep the failSilently in Done https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java: PS12, Line 61: (lr.getPath( > keeping the flag outside this, we can easily avoid this NPE too. Done https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java: PS12, Line 55: > instead of passing this in. let this method swallow the exception? Done -- To view, visit https://asterix-gerrit.ics.uci.edu/1821 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d14a07978e106cd497cc35538fafef318b2fcf7 Gerrit-PatchSet: 14 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Yingyi Bu <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Michael Blow <[email protected]> Gerrit-Reviewer: Till Westmann <[email protected]> Gerrit-Reviewer: Yingyi Bu <[email protected]> Gerrit-Reviewer: abdullah alamoudi <[email protected]> Gerrit-HasComments: Yes
