Hello Nicolas ,

Why don't you handle this during the update itself.
Update can be done using a script.
So something like -

curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
    "script" : "ctx._source.tags.contains(tag) ? true :
ctx._source.tags += tag",

    "params" : {
        "tag" : "blue"
    }
}'

Link -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html#docs-update

Thanks
           Vineeth


On Wed, Jul 9, 2014 at 1:42 PM, Nicolas Giraud <[email protected]> wrote:

> Hi,
>
> I am very often confronted to the following pattern when doing batch
> indexing in update mode. Let me picture it with a short example:
>
> Let's pretend I am indexing social bookmarking data: a very crude document
> would be something like:
>
>
>
>
>
>
> *{    "url": "http://somecoolwebsite.org <http://somecoolwebsite.org/>",
>  ...    "usertags": [ "tag1", "tag2", ..., "tagN" ]    ... }*
>
> My indexer processes a very large list of user bookmarks, and batch
> updates/upserts the document in Elasticsearch. My problem is that if I
> simply use concatenation in the update script, I may end up with lots of
> duplicate values in my *usertag* array, as many users potentially used
> the same tag over and over again on a given url. Instead I would like to
> have a set logic on the array values.
>
> Currently I have this pattern on a bunch of uses cases, and I generally
> handle that within the batch program by deduplicating values, and using a
> BerkeleyDB to have as much data as I can in memory. However the performance
> cost becomes prohibitive when I have to perform set logic over millions
> of input
> records. Below 5M I manage to have an acceptable cost, but past 5M
> insertion time in my BDB becomes unacceptable.
>
> Another way would be not to deduplicate and to use terms facet at query
> time to obtain deduplicated values, bit the index size will potentially
> grow significantly.
>
> Lastly one could put in place a post-processing batch to deduplicate the
> values, but that sums up tp reindexing everything. Using batch treatment
> and parallel execution this could probably scale pretty well, but would be
> time consuming.
>
> This is probably a very common pattern, however I'd very much appreciate
> to have some pointers on how other Elasticsearch users dealt with it.
>
> Best regards,
>
> Nicolas
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3DESbSDBjm0f1N6o_LrGPaLT2-0mzPut%3De8zzgtDJQEDg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to