Murtadha Hubail has posted comments on this change.

Change subject: Asterix NCs Fault Tolerance
......................................................................


Patch Set 3:

(47 comments)

https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-common/src/main/java/org/apache/asterix/common/config/AsterixReplicationProperties.java
File 
asterix-common/src/main/java/org/apache/asterix/common/config/AsterixReplicationProperties.java:

Line 35:     //we could centralized then a single configuration file.
> I think that it would be better to put this into a JIRA issue (ideally with
Removed from here and will open a JIRA about it.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-common/src/main/java/org/apache/asterix/common/transactions/IRecoveryManager.java
File 
asterix-common/src/main/java/org/apache/asterix/common/transactions/IRecoveryManager.java:

Line 116:      * Reply the logs that belong to the passed {@code partitions} 
starting from the {@code lowWaterMarkLSN}
> s/Reply/Replay/
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/clusterts/cluster_with_replication.xml
File 
asterix-installer/src/test/resources/clusterts/cluster_with_replication.xml:

Line 19:     <cluster xmlns="cluster">
> Fix indentation?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.1.ddl.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.1.ddl.aql:

Line 22:  * start 2 nodes, bulkload a dataset, query data, kill one node and 
wait until the failover complete, query the data again.
> line break?
Done


Line 28:     create dataverse TinySocial;
> Indentation?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.2.update.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.2.update.aql:

Line 22:  * start 2 nodes, bulkload a dataset, query data, kill one node and 
wait until the failover complete, query the data again.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.3.txnqbc.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.3.txnqbc.aql:

Line 22:  * start 2 nodes, bulkload a dataset, query data, kill one node and 
wait until the failover complete, query the data again.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.6.txnqar.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/bulkload/bulkload.6.txnqar.aql:

Line 22:  * start 2 nodes, bulkload a dataset, query data, kill one node and 
wait until the failover complete, query the data again.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.1.ddl.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.1.ddl.aql:

Line 22:  * start 2 nodes, bulkload a dataset, copy it to in-memory dataset, 
query data from memory, kill one node and wait until the failover complete, 
query the data again.
> line break?
Done


Line 28:     create dataverse TinySocial;
> Indentation?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.2.update.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.2.update.aql:

Line 22:  * start 2 nodes, bulkload a dataset, copy it to in-memory dataset, 
query data from memory, kill one node and wait until the failover complete, 
query the data again.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.3.txnqbc.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.3.txnqbc.aql:

Line 22:  * start 2 nodes, bulkload a dataset, copy it to in-memory dataset, 
query data from memory, kill one node and wait until the failover complete, 
query the data again.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.6.txnqar.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/mem_component_recovery/mem_component_recovery.6.txnqar.aql:

Line 22:  * start 2 nodes, bulkload a dataset, copy it to in-memory dataset, 
query data from memory, kill one node and wait until the failover complete, 
query the data again.
> line break
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.1.ddl.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.1.ddl.aql:

Line 22:  * start 2 nodes, create a dataset, kill metadata node and wait until 
the failover complete, verify the dataset still exists.
> line break?
Done


Line 28:     create dataverse TinySocial;
> Indentation?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.2.txnqbc.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.2.txnqbc.aql:

Line 22:  * start 2 nodes, create a dataset, kill metadata node and wait until 
the failover complete, verify the dataset still exists.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.5.txnqar.aql
File 
asterix-installer/src/test/resources/integrationts/replication/queries/failover/metadata_node/metadata_node.5.txnqar.aql:

Line 22:  * start 2 nodes, create a dataset, kill metadata node and wait until 
the failover complete, verify the dataset still exists.
> line break?
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-om/src/main/java/org/apache/asterix/om/util/AsterixClusterProperties.java
File 
asterix-om/src/main/java/org/apache/asterix/om/util/AsterixClusterProperties.java:

Line 301:         AsterixReplicationProperties repliactionProperties = 
AsterixAppContextInfo.getInstance()
> a/repliactionProperties/replicationProperties/
Done


Line 313:                 if (ncConfiguration.containsKey(replica)) {
> What does it mean, if we got the replica from the replication properties, b
NC configuration contains the current active NCs. Replication properties 
contains the static information of where each node replicas are. So, if any of 
these replica is not active, it is ignored.


Line 337:                 e.printStackTrace();
> This looks a little scary. Why is it ok to ignore the exception?
If we cannot send the request to the NC, which according to our discussion with 
Yingyi, we decided we should assume reliable networking, then this NC has 
failed but we haven't received the failure notification yet. Since this request 
has already been added to the pending requests, then when the failure 
notification arrives, these partitions will be directed to a different active 
replica (if we find any).


Line 382:             e.printStackTrace();
> Should we just continue to run if this fails?
If we cannot send the request to the NC, which according to our discussion with 
Yingyi, we decided we should assume reliable networking, then this NC has 
failed but we haven't received the failure notification yet. When the failure 
notification arrives, some other active replica will handle the metadata 
partition. After that, another metadata node takeover request will be sent to 
this newly assigned replica.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/management/ReplicationChannel.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/management/ReplicationChannel.java:

Line 341:                     String compoentId = 
LSMComponentProperties.getLSMComponentID(afp.getFilePath());
> s/compoentId/componentId/
Done


Line 381:             Map<String, ClusterPartition[]> nodePartitions = 
((IAsterixPropertiesProvider) asterixAppRuntimeContextProvider
> line break?
our formatter allows this length :)


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/management/ReplicationManager.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/management/ReplicationManager.java:

Line 123:     private String hostIPAddressFirstOctet = null;
> This doesn't seem to be a good idea. Why would multiple network interfaces 
UCI wireless networks have different first byte than wired networks. I 
encountered an issue when I was doing demos and both wired and wireless 
interfaces were active. In a real setup however, this shouldn't be the case. 
This is used just in case the IP address of a node has changed since it failed 
and reconnected. The new IP address is broadcasted to other replicas so that 
they know where to reconnect this the node that just came back. This IP address 
configuration should be centralized on the CC or something like ZooKeeper and 
other replicas should fetch it from there.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/recovery/RemoteRecoveryManager.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/recovery/RemoteRecoveryManager.java:

Line 62:         //Currently we will not allow a node to perform remote 
recovery since another replica already tookover its workload
> line break?
Done


Line 73:             //start recovery recovery steps
> Just 1 recovery?
Done


Line 88:                     throw new IllegalStateException("no ACTIVE remote 
replica(s) exists to performe remote recovery");
> s/performe/perform/
Done


Line 103:                 maxRemoteLSN = 
replicationManager.getMaxRemoteLSN(selectedRemoteReplicas.keySet());
> Why is maxRemoteLSN initialized separately?
Removed.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/storage/LSMComponentProperties.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/storage/LSMComponentProperties.java:

Line 142:     public void setComponentId(String componentId) {
> unused?
Removed.


Line 150:     public void setOriginalLSN(long lSN) {
> unused?
Removed.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/storage/LSMIndexFileProperties.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/storage/LSMIndexFileProperties.java:

Line 46:     public LSMIndexFileProperties(String filePath, long fileSize, 
String nodeId, boolean lsmComponentFile,
> unused
Now it is used in create()


Line 96:         LSMIndexFileProperties fileProp = new LSMIndexFileProperties();
> Just use the constructor with arguments?
Done.


Line 105:     public void setFilePath(String filePath) {
> unused?
Removed.


Line 117:     public void setFileName(String fileName) {
> unused?
Removed.


Line 137:     public void setFileSize(long fileSize) {
> unused?
Removed.


Line 145:     public void setIdxName(String idxName) {
> unused?
Removed.


Line 153:     public void setLsmComponentFile(boolean lsmComponentFile) {
> unused?
Removed.


Line 161:     public void setRequiresAck(boolean requiresAck) {
> unused?
Removed.


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-replication/src/main/java/org/apache/asterix/replication/storage/ReplicaResourcesManager.java
File 
asterix-replication/src/main/java/org/apache/asterix/replication/storage/ReplicaResourcesManager.java:

Line 78:                 }
> Any reason why we don't use FileUtils.deleteQuietly here as well?
Used.


Line 126:         }
> Could we just use Files.deleteIfExists here?
Done


Line 337:             }
> "return name.endsWith(LSM_COMPONENT_MASK_SUFFIX);" ?
Done.


Line 347:             }
> "return !name.endsWith(LSM_COMPONENT_MASK_SUFFIX);"?
Done


Line 353:             if 
(name.equalsIgnoreCase(PersistentLocalResourceRepository.METADATA_FILE_NAME)) {
> "return name.equalsIgnoreCase(PersistentLocalResourceRepository.METADATA_FI
Done


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java
File 
asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java:

Line 166:                 ObjectOutputStream oosToFos = new 
ObjectOutputStream(fos)) {
> Nice, AutoCloseables :)
:)


https://asterix-gerrit.ics.uci.edu/#/c/573/3/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/RecoveryManager.java
File 
asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/RecoveryManager.java:

Line 1119:         
//-------------------------------------------------------------------------
> I think it would be nice to split this into 2 methods - one for the analysi
I will be modifying the recovery methods in the next patch. I will see how 
feasible it is.


Line 1155:                                     //if we don't have enough memory 
for one more job, we will force all jobs to spill their cached entities to disk.
> Line breaks?
Done


Line 1268:                                     
datasetLifecycleManager.open(resourceAbsolutePath);
> It seems that these don't get closed if something throws ...
Nice catch :)
I moved the closing part to the finally block and added a try-catch between 
getComponentLSN to close the recently opened index in case of exception.


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/573
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ice26d980912a315fcb3efdd571d6ce88717cfea4
Gerrit-PatchSet: 3
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Murtadha Hubail <hubail...@gmail.com>
Gerrit-Reviewer: Ian Maxon <ima...@apache.org>
Gerrit-Reviewer: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Gerrit-Reviewer: Murtadha Hubail <hubail...@gmail.com>
Gerrit-Reviewer: Till Westmann <ti...@apache.org>
Gerrit-Reviewer: abdullah alamoudi <bamou...@gmail.com>
Gerrit-HasComments: Yes

Reply via email to