Hello Mike

My ES version is 1.2.1
I checked utilization nodes my cluster. Common valus ofr all nodes are:
java proces cpu utilization: < 6%
os load: < 1
io stat: < 15kB/s write

I checked indexing process 2 methods:
a) indexing by native json data (13GB splited to 100MB chunks)
time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST 
h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm 
-f $i; done

b) indexing csv data by use perl script

my $e = Search::Elasticsearch->new(
               nodes => [
                   'h3:9200',
               ]   
           );  


my $bulk = $e->bulk_helper(
    index => $idx_name,
    type  => $idx_type,
    max_count => 10000
);

open(my $DATA, '<', $data_file) or die $!; 
while(<$DATA>) {
    chomp;

    my @data = split(',', $_);
    $bulk->index({ source => {  
                                p0  => $data[0], 
                                p1  => $data[1],
                                p2  => $data[2],
                                p3  => $data[3],
                                p4  => $data[4],
                                p5  => $data[5],
                                p6  => $data[6],
                                p7  => $data[7],
                                p8  => $data[8],
                                p9  => $data[9],
                                p10 => $data[10],
                                p11 => $data[11]
                }});

}
close($DATA);
$bulk->flush;

Setting refresh_interval to 600s in both cases has no effect. Data are 
available immediately. I expect (equal to ES documentation) that new data 
will be available after 10 minutes and in consequently indexing process 
will be quicker but it doesn’t.

Regards

W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless 
napisał:
>
> Which ES version are you using?  You should use the latest (soon to be 
> 1.3): there have been a number of bulk-indexing improvements recently.
>
> Are you using the bulk API with multiple/async client threads?  Are you 
> saturating either CPU or IO in your cluster (so that the test is really a 
> full cluster capacity test)?
>
> Also, the relationship between refresh_interval and indexing performance 
> is tricky: it turns out, -1 is often a poor choice, because it means your 
> bulk indexing threads are sometimes tied up flushing segments when with 
> refreshing enabled, it's a separate thread that does that.  So a refresh of 
> 5s is maybe a good choice.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski <[email protected] 
> <javascript:>> wrote:
>
>> Hello
>>
>> My configuration is:
>> 6 nodes Elasticsearch cluster
>> OS: Centos 6.5
>> JVM: 1.7.0_25
>>
>> Cluster is working fine. I can indexing data, query, etc. Now I'm doing 
>> test on package about ~50mln doc (~13GB). I would like take better 
>> performance during indexing data. To take this target I has been changed 
>> parameter refresh_interval. I did test for 1s, -1 and 600s. Time for 
>> indexing data is that same. I checked configuration (_settings) for index 
>> and value for refresh_interval is ok (has proper value), eg:
>>
>> {
>>   "smt_20140501_100000_20g_norefresh" : {
>>     "settings" : {
>>       "index" : {
>>         "uuid" : "q3imiZGQTDasQUuMWS8oiw",
>>         "number_of_replicas" : "1",
>>         "number_of_shards" : "6",
>>         "refresh_interval" : "600s",
>>         "version" : {
>>           "created" : "1020199"
>>         }
>>       }
>>     }
>>   }
>> }
>>
>>
>>
>> Create index, setting refresh_interval and load is done on that same 
>> cluster node. Before test index is deleted and created again before start 
>> new test with new value of refresh_interval. All cluster nodes logs 
>> information that parameter has been changed, eg:
>> [2014-07-16 11:24:09,813][INFO ][index.shard.service      ] [h6] 
>> [smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s] 
>> to [-1]
>> or
>> [2014-07-16 11:32:32,928][INFO ][index.shard.service      ] [h6] 
>> [smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s] 
>> to [10m]
>>
>> After start test new data are available immediately and indexing time 
>> that same in 3 cases. I don't know where is failure. Somebody know what is 
>> going on?
>>
>> Regards
>> Marek
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7c6e2fef-4e40-44d8-a1ea-eade7880d5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to