[ https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Radhika Kundam updated ATLAS-4408: ---------------------------------- Description: *Index failure resilience:* dynamic handling of failure in updating index (i.e. HBase commit succeeds but index commit fails). Design details as below. # Start new service - IndexRecoveryService at Atlas startup. ## Continuously monitor for Solr(Index Client) health for every retryTime millisecs ### If Solr is healthy and recovery start time is available, #### Start Transaction Recovery with available recovery start time(which is noted when Solr became unhealthy) #### Persist current recovery time as previous which can be used later by passing as custom recovery time to start index recovery if required. #### Reset current recovery start time #### Continue with Solr health checkup. ### If Solr is unhealthy and no recovery start time is available, #### Shutdown the existing transaction recovery process. #### Note down the time which should be the next recovery start time and persist in graph. #### Continue with Solr health checkup. Configuration properties to be used for this feature. 1.To enable or disable index recovery(By default index recovery will be enabled on Atlas startup) *atlas.graph.enable.index.recovery=true* 2.To configure how frequently SOLR health check should be done *atlas.graph.index.search.solr.status.retry.interval=<time in ms>* 3.To start index recovery by custom recovery time as user provided *atlas.graph.index.search.solr.recovery.start.time=1630086622* was: *Index failure resilience:* dynamic handling of failure in updating index (i.e. HBase commit succeeds but index commit fails * monitor thread to check state of index * save index state in graph node * basic-search to use graph-queries instead of index-queries * partial reindex of vertices i.e. vertices that were updated since last successful index update > Dynamic handling of failure in updating index > --------------------------------------------- > > Key: ATLAS-4408 > URL: https://issues.apache.org/jira/browse/ATLAS-4408 > Project: Atlas > Issue Type: New Feature > Components: atlas-core > Reporter: Radhika Kundam > Assignee: Radhika Kundam > Priority: Major > > *Index failure resilience:* dynamic handling of failure in updating index > (i.e. HBase commit succeeds but index commit fails). > Design details as below. > # Start new service - IndexRecoveryService at Atlas startup. > ## Continuously monitor for Solr(Index Client) health for every retryTime > millisecs > ### If Solr is healthy and recovery start time is available, > #### Start Transaction Recovery with available recovery start time(which is > noted when Solr became unhealthy) > #### Persist current recovery time as previous which can be used later by > passing as custom recovery time to start index recovery if required. > #### Reset current recovery start time > #### Continue with Solr health checkup. > ### If Solr is unhealthy and no recovery start time is available, > #### Shutdown the existing transaction recovery process. > #### Note down the time which should be the next recovery start time and > persist in graph. > #### Continue with Solr health checkup. > Configuration properties to be used for this feature. > 1.To enable or disable index recovery(By default index recovery will be > enabled on Atlas startup) > *atlas.graph.enable.index.recovery=true* > 2.To configure how frequently SOLR health check should be done > *atlas.graph.index.search.solr.status.retry.interval=<time in ms>* > 3.To start index recovery by custom recovery time as user provided > *atlas.graph.index.search.solr.recovery.start.time=1630086622* -- This message was sent by Atlassian Jira (v8.3.4#803005)