Yingyi Bu has posted comments on this change.

Change subject: [ASTERIXDB-1943][API][STO] Make rebalance idempotent.
......................................................................


Patch Set 14:

(14 comments)

https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java
File 
asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java:

PS12, Line 74: Future
> parametrize Future throughout this class?
Couldn't understand what exactly you mean?


PS12, Line 77: rebalanceFutureTerminated
> I don't understand the need for this queue and the use of the countdownlatc
Basically if the Future.get() is called after Future.cancel(true), you get 
CancellationException immediately and cannot track if the Future is done or 
not.  Here, I want to make sure that post(..) doesn't return to the user until 
the rebalance thread is all done.

See the selected answer here: 

https://stackoverflow.com/questions/21068547/how-to-interrupt-a-future-but-still-wait-for-it-to-finish


PS12, Line 174:          : getAllDatasetsForRebalance(dataverseName);
              :                 for (Dataset dataset : datasets) {
              :                     // 
> Log bad requests?
Done


PS12, Line 159: lanceTasks.add(task);
              :         rebalanceFutureTerminated.add(terminated);
              :         return terminated;
              :     }
              : 
              :     // Performs the actual rebalance.
              :     private void doRebalance(String dataverseName, String 
datasetName, String[] targetNodes, IServletResponse response,
              :             CountDownLatch terminated) {
              :         try {
              :             // Sets the content type.
              :             HttpUtil.setContentType(response, 
HttpUtil.ContentType.APPLICATION_JSON, HttpUtil.Encoding.UTF8);
              : 
              :             if (datasetName == null) {
              :                 // Rebalances datasets in a given dataverse or 
all non-metadata datasets.
              :                 List<Dataset> datasets = dataverseName == null 
? getAllDatasetsForRebalance()
              :                         : 
getAllDatasetsForRebalance(dataverseName);
              :                 for (Dataset dataset : datasets) {
              :                     // By the time rebalanceDataset(...) is 
called, the dataset could have been dropped.
              :                     // If that's the case, 
rebalanceDataset(...) would be a no-op.
              :                     
rebalanceDataset(dataset.getDataverseName(), dataset.getDatasetName(), 
targetNodes);
              :                 }
              :             } else {
              :                 // Rebalances a given dataset from its current 
locations to the target nodes.
              :                 rebalanceDataset(dataverseName, datasetName, 
targetNodes);
              :             }
              : 
> why not validate before submitting the task?
Done


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java
File 
asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java:

PS12, Line 132: / Up to this point, since the bulk part of a rebalance 
operation is done,
              :         // the following two operations will retry after 
interrupt and finally rethrow InterruptedException,
              :         // which means that they will always succeed and could 
possibly throw InterruptedException as the last step.
              :         // TODO(yingyi): ASTERIXDB-1948, in case a crash 
happens, currently the system will either:
> Create an issue?
The issue is  ASTERIXDB-1948.  I've updated the comments too.


PS12, Line 148: }
> why introduce a new interface?
Tired Callable<Void>, it in cascade requires the involved methods to return 
null for the return type Void.

The interface is private, so nobody else can see/use it.


PS12, Line 167: }
> Log too when that happens. maybe along with the number of interrupted attem
RebalanceWithCancellationIT can cover it, sporadically.  Without the retry 
loop, sometimes I see that the node group name returned from test queries have 
the name with suffix as a UUID, which means the nodegroup for the source was 
leaked due to interruption.

For instance, see 
asterixdb/asterix-app/src/test/resources/runtimets/queries_sqlpp/rebalance/single_dataset/single_dataset.11.query.sqlpp


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java
File 
asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java:

PS12, Line 96: rc == 200 || rc == 404
> the coverage of this test is weak and could miss regressions. Create tests 
Regressions?  


>> the coverage of this test is weak


The coverage is lifted in RebalanceWithCancellationIT -- it executes each 
single tests 20 times and interrupts at different time points.


It's a new test RebalanceWithCancellationIT. 


>> Also cover the test case for the cancellation during the dataset switch!
This seems challenging...  I don't have a good idea as it involves 
MetadataManager/MetadataNode RPC and we need to make sure the RPC gets 
interrupted.


It don't think it will go too bad if we leave the coverage for that blank:

(1) it's a very quick operation, e.g., 10ms

(2) rebalanceSwitch() is wrapped within a metadata txn, if anything fails, it 
aborts the metadata txn, which cancels the dataset metadata entity update.

(3) if it's crashed, either the dataset is switched or not, we can still 
continue as discussed in: 
https://cwiki.apache.org/confluence/display/ASTERIXDB/Rebalance+API+and+Internal+Implementation


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java
File 
asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java:

PS12, Line 274: 
> this error message should be fixed. preferably with an error code!
Done


PS12, Line 280:   }
              :         } finally {
              :             // Regardless of what exception is thrown in the 
try-block (e.g., line 279),
              :             // we have to un-touch the index and dataset.
              :             if (iInfo != null) {
              :                 iInfo.untouch();
              :             }
> Can we add a comment explaining in what scenario can each of these two case
Done


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java
File 
asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java:

PS12, Line 219:    resourceCache.invalidate(relativePath);
              :                 resourceFile.delete();
              :             } finally {
              :                 // Regardless of successfully deleted o
> I think this whole method should be atomic. an easy way to do this is to ma
>>I think this whole method should be atomic. 
>>an easy way to do this is to make it un-interruptible. 


I did some experiments, it turns out that File.delete() cannot be interrupted.


>>note that as per the comment, 
>>this could lead to a resource being deleted 
>>locally but not on replica since the code below will not be executed!


Done.


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java
File 
hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java:

PS12, Line 49: e index instance
> I think this interface shouldn't change and we can keep the failSilently in
Done


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java
File 
hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java:

PS12, Line 61: (lr.getPath(
> keeping the flag outside this, we can easily avoid this NPE too.
Done


https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java
File 
hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java:

PS12, Line 55: 
> instead of passing this in. let this method swallow the exception?
Done


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/1821
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d14a07978e106cd497cc35538fafef318b2fcf7
Gerrit-PatchSet: 14
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Yingyi Bu <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Reviewer: Michael Blow <[email protected]>
Gerrit-Reviewer: Till Westmann <[email protected]>
Gerrit-Reviewer: Yingyi Bu <[email protected]>
Gerrit-Reviewer: abdullah alamoudi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to