I missed that the later doc would only be partial. What is the reason to use the partial doc? That really complicates things.
Filling in missing fields is going to be a very large headache. You'll probably kill performance trying to do it too. Likely it'll be so complex it will present a lot more trouble. I think if you can better present the overall use cases you will get better insight into how to work this out. On Thursday, May 1, 2014 4:51:03 PM UTC-7, Michał Zgliczyński wrote: > > Hi, > Thank you for your response. I have looked through this blog post: > http://www.elasticsearch.org/blog/elasticsearch-versioning-support/ > It looks as if external versioning would be the way to go. Have the > timestamps act as version numbers and let ES only pick the document with > the newest version as the correct document. However, with the situation I > have presented above, ES will fail. A quote from the post: > "With version_type set to external, Elasticsearch will store the version > number as given and will not increment it. Also, instead of checking for an > exact match, Elasticsearch will only return a version collision error if > the version currently stored is greater or equal to the one in the indexing > command. This effectively means “only store this information if no one else > has supplied the same or a more recent version in the meantime”. > Concretely, the above request will succeed if the stored version number is > smaller than 526. 526 and above will cause the request to fail." > > In my example, we would have that situation. A partial doc with a larger > version number(later timestamp) is already stored in ES and we get the > complete document with a smaller timestamp. In this situation we would like > to merge these 2 documents in a way that, we have all of the fields from > the partial doc and the other fields(not currently specified in the ES > document) to be filled from the complete document. > > Thanks! > Michal Zgliczynski > > W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway napisał: >> >> Have you looked at using versioning? >> >> >> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning >> >> cheers, >> Rob >> >> On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote: >>> >>> Hi, >>> >>> I am building a system in which I will have two sources of updates: >>> 1) Bulk updating from the source of truth(db) <- Always inserting >>> documents(complete docs) >>> 2) Live updates <- Adding insert and update (complete and incomplete >>> docs) >>> >>> Also, lets assume that each insert/update has a timestamp, which we >>> belive in (not ES timestamp). >>> >>> The idea is to have a complete, up to date index once the bulk updating >>> finishes. To achieve this I need to guarantee that I will have the correct >>> data. This would work mostly well, if everything we would do upserts and >>> the inserts/updates coming into ES have a strictly increasing timestamp. >>> But one could imagine that this is a possibly problematic situation, >>> when: >>> >>> 1) We are performing bulk indexing, >>> a) we read an object from the db >>> b) process it >>> c) send it to ES. >>> 2) We have an update on the same object, after step (a) and before if >>> makes to ES in the bulk updating - phase(c). That is, ES gets an update >>> with new data and only after that we get the insert with the entire >>> document from the source of truth with older data. Hence, in ES we have a >>> document with a newer timestamp, than the newly added one phase(c). >>> >>> My theoretical solution: For each operation, have the timestamp for that >>> change (timestamp from the system that made the change, not from Elastic >>> Search). Lets say that all of the operations that we will perform are >>> upserts. >>> Then once we get an insert or an update (lets call it doc), we have to >>> perform the following script (pseudo mvel) inside ES. >>> { >>> if (doc.timestamp > ctx.source.timestamp) { >>> // doc is newer than what was in ES >>> upsert(doc); // update the index with all of the info from the new >>> doc >>> } else { >>> // there is already a document in ES with a newer timestamp, note, >>> this may be an incomplete document (an update) >>> __fill the missing fields in the document in ES with values from >>> doc__ >>> } >>> } >>> >>> My question is: >>> 1) Is there a better approach? >>> 2) If so, is there a simple approach for doing the ' __fill the missing >>> fields in the document in ES with values from doc__' operation/script? >>> >>> Thanks! >>> Michal Zgliczynski >>> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a254b71-71b1-4dbe-8df1-0396fc2773bd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
