On 11 August 2016 at 11:10, Chetan Mehrotra <[email protected]> wrote:
> On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <[email protected]> wrote: > > Both Solr Cloud and ES address this by sharding and > > replicating the indexes, so that all commits are soft, instant and real > > time. That introduces problems. > ... > > Both Solr Cloud and ES address this by sharding and > > replicating the indexes, so that all commits are soft, instant and real > > time. > > This would really be useful. However I have couple of aspects to clear > > Index Update Gurantee > -------------------------------- > > Lets say if commit succeeds and then we update the index and index > update fails for some reason. Then would that update be missed or > there can be some mechanism to recover. I am not very sure about WAL > here that may be the answer here but still confirming. > For ES (I don't know about how the Solr Cloud WAL behaves) The update be accepted until it's written to the WAL so if something fails before that, then the it's upto how the queue of updates is managed which is client side. If its written to the WAL, whatever happens it will be indexed eventually, provided the WAL is available. Think of the WAL as equivalent to the Oak Journal, IIUC. The WAL is present on all replicas, so provided 1 replica is available on shard, no data is lost. > > In Oak with the way async index update works based on checkpoint its > ensured that index would "eventually" contain the right data and no > update would be lost. if there is a failure in index update then that > would fail and next cycle would start again from same base state > Sound like the same level of guarantee, depending on how the client side is implemented. Typically I didnt bother with a queue between the application and the ES client because the ES client was so fast. > > Order of index update > ----------------------------- > > Lets say I have 2 cluster nodes where same node is being performed > > Original state /a {x:1} > > Cluster Node N1 - /a {x:1, y:2} > Cluster Node N2 - /a {x:1, z:3} > > End State /a {x:1, y:2, z:3} > > At Oak level both the commits would succeed as there is no conflict. > However N1 and N2 would not be seeing each other updates immediately > and that would depend on background read. So in this case how would > index update would look like. > > 1. Would index update for specific paths go to some master which would > order the update > correct. Documents are shared by ID so all updates hit the same shard. That may result in network traffic if the shard is not local. > 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3} > > Here current async index update logic ensures that it sees the > eventually expected order of changes and hence would be consistent > with repository state. > Backup and Restore > --------------------------- > > Would the backup now involve backup of ES index files from each > cluster node. Or assuming full replication it would involve backup of > files from any one of the nodes. Would the back be in sync with last > changes done in repository (assuming sudden shutdown where changes got > committed to repository but not yet to any index) > > Here current approach of storing index files as part of MVCC storage > ensures that index state is consistent to some "checkpointed" state in > repository. And post restart it would eventually catch up with the > current repository state and hence would not require complete rebuild > of index in case of unclean shutdowns > If the revision is present in the document, then I assume it can be filtered at query time. However, there may be problems here, as might have to find some way of indexing the revision history of a document.... like the format in MongoDB... I did wonder if a better solution was to use ES as the primary storage then all the property indexes would be present by default with no need for any Lucene index plugin..... but I stopped thinking about that with the 1s root document sync as my interest was real time. Best Regards Ian > > > Chetan Mehrotra >
