Hello

I found reason my problems.
Refresh index during usage perl depend on parameters "max_count" and 
"max_size" for 
$e->bulk_helper
Values for this parameters determine when refresh will be done on index.

Tnx for help.

Regards


W dniu czwartek, 17 lipca 2014 09:59:55 UTC+2 użytkownik Marek Dabrowski 
napisał:
>
> Hello Mike
>
> My ES version is 1.2.1
> I checked utilization nodes my cluster. Common valus ofr all nodes are:
> java proces cpu utilization: < 6%
> os load: < 1
> io stat: < 15kB/s write
>
> I checked indexing process 2 methods:
> a) indexing by native json data (13GB splited to 100MB chunks)
> time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST 
> h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm 
> -f $i; done
>
> b) indexing csv data by use perl script
>
> my $e = Search::Elasticsearch->new(
>                nodes => [
>                    'h3:9200',
>                ]   
>            );  
>
>
> my $bulk = $e->bulk_helper(
>     index => $idx_name,
>     type  => $idx_type,
>     max_count => 10000
> );
>
> open(my $DATA, '<', $data_file) or die $!; 
> while(<$DATA>) {
>     chomp;
>
>     my @data = split(',', $_);
>     $bulk->index({ source => {  
>                                 p0  => $data[0], 
>                                 p1  => $data[1],
>                                 p2  => $data[2],
>                                 p3  => $data[3],
>                                 p4  => $data[4],
>                                 p5  => $data[5],
>                                 p6  => $data[6],
>                                 p7  => $data[7],
>                                 p8  => $data[8],
>                                 p9  => $data[9],
>                                 p10 => $data[10],
>                                 p11 => $data[11]
>                 }});
>
> }
> close($DATA);
> $bulk->flush;
>
> Setting refresh_interval to 600s in both cases has no effect. Data are 
> available immediately. I expect (equal to ES documentation) that new data 
> will be available after 10 minutes and in consequently indexing process 
> will be quicker but it doesn’t.
>
> Regards
>
> W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless 
> napisał:
>>
>> Which ES version are you using?  You should use the latest (soon to be 
>> 1.3): there have been a number of bulk-indexing improvements recently.
>>
>> Are you using the bulk API with multiple/async client threads?  Are you 
>> saturating either CPU or IO in your cluster (so that the test is really a 
>> full cluster capacity test)?
>>
>> Also, the relationship between refresh_interval and indexing performance 
>> is tricky: it turns out, -1 is often a poor choice, because it means your 
>> bulk indexing threads are sometimes tied up flushing segments when with 
>> refreshing enabled, it's a separate thread that does that.  So a refresh of 
>> 5s is maybe a good choice.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski <[email protected]> 
>> wrote:
>>
>>> Hello
>>>
>>> My configuration is:
>>> 6 nodes Elasticsearch cluster
>>> OS: Centos 6.5
>>> JVM: 1.7.0_25
>>>
>>> Cluster is working fine. I can indexing data, query, etc. Now I'm doing 
>>> test on package about ~50mln doc (~13GB). I would like take better 
>>> performance during indexing data. To take this target I has been changed 
>>> parameter refresh_interval. I did test for 1s, -1 and 600s. Time for 
>>> indexing data is that same. I checked configuration (_settings) for index 
>>> and value for refresh_interval is ok (has proper value), eg:
>>>
>>> {
>>>   "smt_20140501_100000_20g_norefresh" : {
>>>     "settings" : {
>>>       "index" : {
>>>         "uuid" : "q3imiZGQTDasQUuMWS8oiw",
>>>         "number_of_replicas" : "1",
>>>         "number_of_shards" : "6",
>>>         "refresh_interval" : "600s",
>>>         "version" : {
>>>           "created" : "1020199"
>>>         }
>>>       }
>>>     }
>>>   }
>>> }
>>>
>>>
>>>
>>> Create index, setting refresh_interval and load is done on that same 
>>> cluster node. Before test index is deleted and created again before start 
>>> new test with new value of refresh_interval. All cluster nodes logs 
>>> information that parameter has been changed, eg:
>>> [2014-07-16 11:24:09,813][INFO ][index.shard.service      ] [h6] 
>>> [smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s] 
>>> to [-1]
>>> or
>>> [2014-07-16 11:32:32,928][INFO ][index.shard.service      ] [h6] 
>>> [smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s] 
>>> to [10m]
>>>
>>> After start test new data are available immediately and indexing time 
>>> that same in 3 cases. I don't know where is failure. Somebody know what is 
>>> going on?
>>>
>>> Regards
>>> Marek
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fb81cccc-d826-46d0-b37f-ca63e74093d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to