Hi, 

I am very often confronted to the following pattern when doing batch 
indexing in update mode. Let me picture it with a short example: 

Let's pretend I am indexing social bookmarking data: a very crude document 
would be something like: 






*{    "url": "http://somecoolwebsite.org <http://somecoolwebsite.org/>",   
 ...    "usertags": [ "tag1", "tag2", ..., "tagN" ]    ... }*

My indexer processes a very large list of user bookmarks, and batch 
updates/upserts the document in Elasticsearch. My problem is that if I 
simply use concatenation in the update script, I may end up with lots of 
duplicate values in my *usertag* array, as many users potentially used the 
same tag over and over again on a given url. Instead I would like to have a 
set logic on the array values. 

Currently I have this pattern on a bunch of uses cases, and I generally 
handle that within the batch program by deduplicating values, and using a 
BerkeleyDB to have as much data as I can in memory. However the performance 
cost becomes prohibitive when I have to perform set logic over millions of 
input 
records. Below 5M I manage to have an acceptable cost, but past 5M 
insertion time in my BDB becomes unacceptable. 

Another way would be not to deduplicate and to use terms facet at query 
time to obtain deduplicated values, bit the index size will potentially 
grow significantly. 

Lastly one could put in place a post-processing batch to deduplicate the 
values, but that sums up tp reindexing everything. Using batch treatment 
and parallel execution this could probably scale pretty well, but would be 
time consuming. 

This is probably a very common pattern, however I'd very much appreciate to 
have some pointers on how other Elasticsearch users dealt with it. 

Best regards, 

Nicolas

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to