Re: ingest performance degrades sharply along with the documents having more fileds

Maco Ma Tue, 17 Jun 2014 19:50:28 -0700

Hi Mike,

new_ES_config.sh(define the templates and disable the refresh/flush):
curl -XPOST localhost:9200/doc -d '{
      "mappings" : {
          "type" : {
                  "_source" : { "enabled" : false },
                  "dynamic_templates" : [
                    {"t1":{
                  "match" : "*_ss",
                  "mapping":{
                        "type": "string",
                        "store":false,
                        "norms" : {"enabled" : false}
                        }
                        }},
                    {"t2":{
                  "match" : "*_dt",
                  "mapping":{
                        "type": "date",
                        "store": false
                        }
                        }},
                    {"t3":{
                  "match" : "*_i",
                  "mapping":{
                        "type": "integer",
                        "store": false
                        }
                        }}
]
              }
        }
  }'


curl -XPUT localhost:9200/doc/_settings -d '{
      "index.refresh_interval" : "-1"
}'

curl -XPUT localhost:9200/doc/_settings -d '{
      "index.translog.disable_flush" : true
}'

new_ES_ingest_threads.pl( spawn 10 threads to use curl command to ingest 
the doc and one thread to flush/optimize periodically):

my $num_args = $#ARGV + 1;
if ($num_args < 1 || $num_args > 2) {
  print "\n usuage:$0 [src_dir] [thread_count]\n";
  exit;
}

my $INST_HOME="/scratch/aime/elasticsearch-1.2.1";

my $pid = qx(jps | sed -e '/Elasticsearch/p' -n | sed 's/ .*//');
chomp($pid);
if( "$pid" eq "")
{
  print "Instance is not up\n";
  exit;
}


my $dir = $ARGV[0];
my $td_count = 10;
$td_count = $ARGV[1] if($num_args == 2);
open(FH, ">$lf");
print FH "source dir: $dir\nthread_count: $td_count\n";
print FH localtime()."\n";

use threads;
use threads::shared;

my $flush_intv = 10;

my $no:shared=0;
my $total = 10000;
my $intv = 1000;
my $tstr:shared = "";
my $ltime:shared = time;

sub commit {
  $SIG{'KILL'} = sub {`curl -XPOST 
'http://localhost:9200/doc/_flush'`;print "forced commit done on 
".localtime()."\n";threads->exit();};

  while ($no < $total )
  {
    `curl -XPOST 'http://localhost:9200/doc/_flush'`;
    `curl -XPOST 'http://localhost:9200/doc/_optimize'`;
    print "commit on ".localtime()."\n";
    sleep($flush_intv);
  }
  `curl -XPOST 'http://localhost:9200/doc/_flush'`;
  print "commit done on ".localtime()."\n";
}

sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl -XPOST -s localhost:9200/doc/type/$c --data-binary 
\@$dir/$c.json`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}

# start the monitor processes
my $sarId = qx(sar -A 5 100000 -o sar5sec_$dir.out > /dev/null &\necho \$!);
my $jgcId = qx(jstat -gc $pid 2s > jmem_$dir.out &\necho \$!);

my $ct = threads->create(\&commit);
my $start = time;
my @ts=();
for $i (1..$td_count)
{
  my $t = threads->create(\&do);
  push(@ts, $t);
}

for my $t (@ts)
{
  $t->join();
}

$ct->kill('KILL');
my $fin = time;

qx(kill -9 $sarId\nkill -9 $jgcId);

print FH localtime()."\n";
$ct->join();
print FH qx(curl 'http://localhost:9200/doc/type/_count?q=*');
close(FH);

new_Solr_ingest_threads.pl is similar to the file  new_ES_ingest_threads.pl 
and uses the different parameters for curl commands. Only post the 
differences here:

sub commit {
  while ($no < $total )
  {
    `curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
    `curl  'http://localhost:8983/solr/collection2/update?optimize=true'`;
    print "commit on ".localtime()."\n";
    sleep(10);
  }
  `curl  'http://localhost:8983/solr/collection2/update?commit=true'`;
  print "commit done on ".localtime()."\n";
}


sub do {
  my $c = -1;
  while(1)
  {
    {
      lock($no);
      $c=$no;
      $no++;
    }
    last if($c >= $total);
    `curl  -s 'http://localhost:8983/solr/collection2/update/json' 
--data-binary \@$dir/$c.json -H 'Content-type:application/json'`;
    if( ($c +1) % $intv == 0 )
    {
      lock($ltime);
      $curtime = time;
      $tstr .= ($curtime - $ltime)." ";
      $ltime = $curtime;
    }
  }
}


B&R
Maco

On Wednesday, June 18, 2014 4:44:35 AM UTC+8, Michael McCandless wrote:
>
> Hi,
>
> Could you post the scripts you linked to (new_ES_config.sh, 
> new_ES_ingest_threads.pl, new_Solr_ingest_threads.pl) inlined?  I can't 
> download them from where you linked.
>
> Optimizing every 10 seconds or 10 minutes is really not a good idea in 
> general, but I guess if you're doing the same with ES and Solr then the 
> comparison is at least "fair".
>
> It's odd you see such a slowdown with ES...
>
> Mike
>
> On Fri, Jun 13, 2014 at 2:40 PM, Cindy Hsin <[email protected] 
> <javascript:>> wrote:
>
>> Hi, Mark:
>>
>> We are doing single document ingestion. We did a performance comparison 
>> between Solr and Elastic Search (ES).
>> The performance for ES degrades dramatically when we increase the 
>> metadata fields where Solr performance remains the same. 
>> The performance is done in very small data set (ie. 10k documents, the 
>> index size is only 75mb). The machine is a high spec machine with 48GB 
>> memory.
>> You can see ES performance drop 50% even when the machine have plenty 
>> memory. ES consumes all the machine memory when metadata field increased to 
>> 100k. 
>> This behavior seems abnormal since the data is really tiny.
>>
>> We also tried with larger data set (ie. 100k and 1Mil documents), ES 
>> throw OOW for scenario 2 for 1 Mil doc scenario. 
>> We want to know whether this is a bug in ES and/or is there any 
>> workaround (config step) we can use to eliminate the performance 
>> degradation. 
>> Currently ES performance does not meet the customer requirement so we 
>> want to see if there is anyway we can bring ES performance to the same 
>> level as Solr.
>>
>> Below is the configuration setting and benchmark results for 10k document 
>> set.
>> scenario 0 means there are 1000 different metadata fields in the system.
>> scenario 1 means there are 10k different metatdata fields in the system.
>> scenario 2 means there are 100k different metadata fields in the system.
>> scenario 3 means there are 1M different metadata fields in the system.
>>
>>    - disable hard-commit & soft commit + use a *client* to do commit (ES 
>>    & Solr) every 10 second
>>    - ES: flush, refresh are disabled
>>       - Solr: autoSoftCommit are disabled
>>    - monitor load on the system (cpu, memory, etc) or the ingestion 
>>    speed change over time
>>    - monitor the ingestion speed (is there any degradation over time?) 
>>    - new ES config:new_ES_config.sh; new ingestion: 
>>    new_ES_ingest_threads.pl
>>    - new Solr ingestion: new_Solr_ingest_threads.pl
>>    - flush interval: 10s
>>
>>
>> Number of different meta data field ESSolrScenario 0: 100012secs -> 
>> 833docs/sec
>> CPU: 30.24%
>> Heap: 1.08G
>> time(secs) for each 1k docs:3 1 1 1 1 1 0 1 2 1
>> index size: 36M
>> iowait: 0.02%13 secs -> 769 docs/sec
>> CPU: 28.85%
>> Heap: 9.39G
>> time(secs) for each 1k docs: 2 1 1 1 1 1 1 1 2 2Scenario 1: 10k29secs -> 
>> 345docs/sec
>> CPU: 40.83%
>> Heap: 5.74G
>> time(secs) for each 1k docs:14 2 2 2 1 2 2 1 2 1
>> iowait: 0.02%
>> Index Size: 36M12 secs -> 833 docs/sec
>> CPU: 28.62%
>> Heap: 9.88G
>> time(secs) for each 1k docs:1 1 1 1 2 1 1 1 1 2Scenario 2: 100k17 mins 
>> 44 secs -> 9.4docs/sec
>> CPU: 54.73%
>> Heap: 47.99G
>> time(secs) for each 1k docs:97 183 196 147 109 89 87 49 66 40
>> iowait: 0.02%
>> Index Size: 75M13 secs -> 769 docs/sec
>> CPU: 29.43%
>> Heap: 9.84G
>> time(secs) for each 1k docs:2 1 1 1 1 1 1 1 2 2 Scenario 3: 1M183 mins 8 
>> secs -> 0.9 docs/sec
>> CPU: 40.47%
>> Heap: 47.99G
>> time(secs) for each 1k docs:133 422 701 958 989 1322 1622 1615 1630 159415 
>> secs -> 666.7 docs/sec
>> CPU: 45.10%
>> Heap: 9.64G
>> time(secs) for each 1k docs:2 1 1 1 1 2 1 1 3 2
>>
>> Thanks!
>> Cindy
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/4efc9c2d-ead4-4702-896d-dc32b5867859%40googlegroups.com
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7d0dc7fa-64cd-4adf-8c8b-f1a2ebd644f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ingest performance degrades sharply along with the documents having more fileds

Reply via email to