[ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radhika Kundam updated ATLAS-4408:
----------------------------------
    Description: 
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

In case of secondary persistence failure scenario, there will be inconsistency 
with indexes for all the transactions failed at Solr. And to repair that, the 
existing option is re-indexing all the data which is time consuming as it 
involves indexing the entire database.

To recover such inconsistencies we can use the *transaction write-ahead log 
option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
transaction log data which can be used to recover indices in case of failures. 
With this approach, it’s extra overhead to maintain the log data for all 
transactions but with this approach we can guarantee the system is more 
resilient and proactive. So advantages of this approach can nullify the 
overhead of maintaining log data.

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
 #### Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
 #### Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
 #### Reset current recovery start time
 #### Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
 #### Shutdown the existing transaction recovery process.
 #### Note down the time which should be the next recovery start time and 
persist in graph.
 #### Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
    *atlas.graph.enable.index.recovery=true*
 2.To configure how frequently SOLR health check should be done
    *atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
 3.To start index recovery by custom recovery time as user provided
    *atlas.graph.index.search.solr.recovery.start.time=1630086622*

 

  was:
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

To support this feature, need to enable *tx.log-tx* property which will start 
storing write-ahead logs.*With this approach we need to maintain more data 
related to write-ahead transaction logs*. But by comparing the advantages of 
index recovery proactively over reindexing entire data incase of secondary 
persistent failures, it's worth  to have this feature though overhead of 
maintaining more data.

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
 #### Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
 #### Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
 #### Reset current recovery start time
 #### Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
 #### Shutdown the existing transaction recovery process.
 #### Note down the time which should be the next recovery start time and 
persist in graph.
 #### Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
    *atlas.graph.enable.index.recovery=true*
 2.To configure how frequently SOLR health check should be done
    *atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
 3.To start index recovery by custom recovery time as user provided
    *atlas.graph.index.search.solr.recovery.start.time=1630086622*

 


> Dynamic handling of failure in updating index
> ---------------------------------------------
>
>                 Key: ATLAS-4408
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4408
>             Project: Atlas
>          Issue Type: New Feature
>          Components:  atlas-core
>            Reporter: Radhika Kundam
>            Assignee: Radhika Kundam
>            Priority: Major
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data which is time 
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log 
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
> transaction log data which can be used to recover indices in case of 
> failures. With this approach, it’s extra overhead to maintain the log data 
> for all transactions but with this approach we can guarantee the system is 
> more resilient and proactive. So advantages of this approach can nullify the 
> overhead of maintaining log data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>  #### Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>  #### Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>  #### Reset current recovery start time
>  #### Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>  #### Shutdown the existing transaction recovery process.
>  #### Note down the time which should be the next recovery start time and 
> persist in graph.
>  #### Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>     *atlas.graph.enable.index.recovery=true*
>  2.To configure how frequently SOLR health check should be done
>     *atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
>  3.To start index recovery by custom recovery time as user provided
>     *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to