Re: Upsert and Script on large index cause the cluster to timeout.

Christophe Verbinnen Mon, 08 Dec 2014 14:08:55 -0800

I see what you mean but the way my records are it cannot happen unless I 
reindex it.


Le lundi 8 décembre 2014 12:05:13 UTC-8, Nikolas Everett a écrit :
>
> I'm not sure what is up but remember that post_ids in the script is a list 
> not a set. You might be growing it without bounds. 
> On Dec 8, 2014 2:49 PM, "Christophe Verbinnen" <[email protected] 
> <javascript:>> wrote:
>
>> Hello,
>>
>> We have a small cluster with 3 nodes running 1.3.6.
>>
>> I have an index setup with only two fields.
>>
>>           {
>>             index: index_name,
>>             body: {
>>               settings: {
>>                   number_of_shards: 3,
>>                   store: {
>>                     type: :mmapfs
>>                   }
>>               },
>>               mappings: {
>>                 mapping_name => {
>>                   properties: {
>>                     :value => {type: 'string', analyzer: 'keyword'},
>>                     :post_ids => {type: 'long', index: 'not_analyzed'}
>>                   }
>>                 }
>>               }
>>             }
>>           }
>>
>>
>> We are basically storing strings and all the post they are related to.
>>
>> The problem is that this data is not stored this way in the database so I 
>> don't have an id to represent each string nor do I have all the post_ids 
>> from the start.
>>
>> So I use the sha1 of the string value as id and I use and script to 
>> append to the post_ids.
>>
>> Here is my code that I use to index using the bulk api end point.
>>
>> def index!
>>   posts_ids = Post.where...
>>   bulk_data = []
>>   strings.uniq.each do |string|
>>     string_id = Digest::SHA1.hexdigest string
>>     bulk_data <<
>>       {
>>         update:
>>         {
>>           _index: 'post_strings',
>>           _type: 'post_string',
>>           _id: string_id,
>>           data: {
>>             script: "ctx._source.post_ids += additional_post_ids",
>>             params: {
>>               additional_post_ids: post_ids
>>             },
>>             upsert: {
>>               value: string,
>>               post_ids: post_ids
>>             }
>>           }
>>         }
>>       }
>>     if bulk_data.count == 100
>>       $elasticsearch.bulk :body => bulk_data
>>       bulk_data = []
>>     end
>>   end
>>   $elasticsearch.bulk :body => bulk_data if bulk_data.any?
>> end
>>
>> So this worked fine for the first 75 Million strings but It was getting 
>> slower and slower until it reached an indexing rate of only 50 doc per sec.
>>
>> After that the cluster just killed itself because the nodes couldn't take 
>> to each other. 
>>
>> I'm gessing all the threads were blocked trying to index and nodes had no 
>> available threads to respond.
>>
>> At first I tought it would be related to the sha1 id being not very 
>> efficient but with my test with sequencial ids it was not getting better.
>>
>> I'm out of ideas right now. Any help would be greatly appreciated.
>>
>> Cheers.
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/740ab5ce-eaef-4ae4-9a00-f50be5aa45c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Upsert and Script on large index cause the cluster to timeout.

Reply via email to