[
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radhika Kundam updated ATLAS-4408:
----------------------------------
Description:
*Index failure resilience:* dynamic handling of failure in updating index (i.e.
HBase commit succeeds but index commit fails).
In case of secondary persistence failure scenario, there will be inconsistency
with indexes for all the transactions failed at Solr. And to repair that, the
existing option is re-indexing all the data which is time consuming as it
involves indexing the entire database.
To recover such inconsistencies we can use the *transaction write-ahead log
option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the
transaction log data which can be used to recover indices in case of failures.
With this approach, it’s extra overhead to maintain the log data for all
transactions but with this approach we can guarantee the system is more
resilient and proactive. So advantages of this approach can nullify the
overhead of maintaining log data.
Design details as below.
# Start new service - IndexRecoveryService at Atlas startup.
## Continuously monitor for Solr(Index Client) health for every retryTime
millisecs
### If Solr is healthy and recovery start time is available,
#### Start Transaction Recovery with available recovery start time(which is
noted when Solr became unhealthy)
#### Persist current recovery time as previous which can be used later by
passing as custom recovery time to start index recovery if required.
#### Reset current recovery start time
#### Continue with Solr health checkup.
### If Solr is unhealthy and no recovery start time is available,
#### Shutdown the existing transaction recovery process.
#### Note down the time which should be the next recovery start time and
persist in graph.
#### Continue with Solr health checkup.
Configuration properties to be used for this feature.
1.To enable or disable index recovery(By default index recovery will be enabled
on Atlas startup)
*atlas.graph.enable.index.recovery=true*
2.To configure how frequently SOLR health check should be done
*atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
3.To start index recovery by custom recovery time as user provided
*atlas.graph.index.search.solr.recovery.start.time=1630086622*
was:
*Index failure resilience:* dynamic handling of failure in updating index (i.e.
HBase commit succeeds but index commit fails).
To support this feature, need to enable *tx.log-tx* property which will start
storing write-ahead logs.*With this approach we need to maintain more data
related to write-ahead transaction logs*. But by comparing the advantages of
index recovery proactively over reindexing entire data incase of secondary
persistent failures, it's worth to have this feature though overhead of
maintaining more data.
Design details as below.
# Start new service - IndexRecoveryService at Atlas startup.
## Continuously monitor for Solr(Index Client) health for every retryTime
millisecs
### If Solr is healthy and recovery start time is available,
#### Start Transaction Recovery with available recovery start time(which is
noted when Solr became unhealthy)
#### Persist current recovery time as previous which can be used later by
passing as custom recovery time to start index recovery if required.
#### Reset current recovery start time
#### Continue with Solr health checkup.
### If Solr is unhealthy and no recovery start time is available,
#### Shutdown the existing transaction recovery process.
#### Note down the time which should be the next recovery start time and
persist in graph.
#### Continue with Solr health checkup.
Configuration properties to be used for this feature.
1.To enable or disable index recovery(By default index recovery will be enabled
on Atlas startup)
*atlas.graph.enable.index.recovery=true*
2.To configure how frequently SOLR health check should be done
*atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
3.To start index recovery by custom recovery time as user provided
*atlas.graph.index.search.solr.recovery.start.time=1630086622*
> Dynamic handling of failure in updating index
> ---------------------------------------------
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
> Issue Type: New Feature
> Components: atlas-core
> Reporter: Radhika Kundam
> Assignee: Radhika Kundam
> Priority: Major
>
> *Index failure resilience:* dynamic handling of failure in updating index
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be
> inconsistency with indexes for all the transactions failed at Solr. And to
> repair that, the existing option is re-indexing all the data which is time
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the
> transaction log data which can be used to recover indices in case of
> failures. With this approach, it’s extra overhead to maintain the log data
> for all transactions but with this approach we can guarantee the system is
> more resilient and proactive. So advantages of this approach can nullify the
> overhead of maintaining log data.
> Design details as below.
> # Start new service - IndexRecoveryService at Atlas startup.
> ## Continuously monitor for Solr(Index Client) health for every retryTime
> millisecs
> ### If Solr is healthy and recovery start time is available,
> #### Start Transaction Recovery with available recovery start time(which is
> noted when Solr became unhealthy)
> #### Persist current recovery time as previous which can be used later by
> passing as custom recovery time to start index recovery if required.
> #### Reset current recovery start time
> #### Continue with Solr health checkup.
> ### If Solr is unhealthy and no recovery start time is available,
> #### Shutdown the existing transaction recovery process.
> #### Note down the time which should be the next recovery start time and
> persist in graph.
> #### Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be
> enabled on Atlas startup)
> *atlas.graph.enable.index.recovery=true*
> 2.To configure how frequently SOLR health check should be done
> *atlas.graph.index.search.solr.status.retry.interval=<time in ms>*
> 3.To start index recovery by custom recovery time as user provided
> *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)