Re: Bulk Indexing Problems

Joshua P Tue, 09 Sep 2014 10:28:50 -0700

Hi Jörg, 

Can you elaborate on what you mean by I still need more fine tuning?


I've upped the heap size to 4g (in both places I mentioned before because 
it's not clear to me which one ES actually uses). I haven't tried to index 
again yet. 
Other than throttling my indexing, what are some other things I need to be 
thinking about? 

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>
> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and 
> indexing around 1 million docs, you need some more fine tuning, which is 
> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8 
> GB RAM.
>
> Jörg
>
> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <[email protected] 
> <javascript:>> wrote:
>
>> Here is /etc/default/elasticsearch
>>
>> # Run Elasticsearch as this user ID and group ID
>> #ES_USER=elasticsearch
>> #ES_GROUP=elasticsearch
>>
>> # Heap Size (defaults to 256m min, 1g max)
>> ES_HEAP_SIZE=512m
>>
>> # Heap new generation
>> #ES_HEAP_NEWSIZE=
>>
>> # max direct memory
>> #ES_DIRECT_SIZE=
>>
>> # Maximum number of open files, defaults to 65535.
>> MAX_OPEN_FILES=65535
>>
>> # Maximum locked memory size. Set to "unlimited" if you use the
>> # bootstrap.mlockall option in elasticsearch.yml. You must also set
>> # ES_HEAP_SIZE.
>> MAX_LOCKED_MEMORY=unlimited
>>
>> # Maximum number of VMA (Virtual Memory Areas) a process can own
>> #MAX_MAP_COUNT=262144
>>
>> # Elasticsearch log directory
>> #LOG_DIR=/var/log/elasticsearch
>>
>> # Elasticsearch data directory
>> #DATA_DIR=/var/lib/elasticsearch
>>
>> # Elasticsearch work directory
>> #WORK_DIR=/tmp/elasticsearch
>>
>> # Elasticsearch configuration directory
>> #CONF_DIR=/etc/elasticsearch
>>
>> # Elasticsearch configuration file (elasticsearch.yml)
>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>
>> # Additional Java OPTS
>> #ES_JAVA_OPTS=
>>
>> # Configure restart on package upgrade (true, every other setting will 
>> lead to not restarting)
>> #RESTART_ON_UPGRADE=true
>>
>> I also see the same setting in /etc/init.d/elasticsearch. Do you know 
>> which file takes priority? And what a good size would be? 
>>
>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>>>
>>> Hello Joshua , 
>>>
>>> I am not sure which variable you are referring to on the memory settings 
>>> in the config file , please paste the comment and config.
>>> I usually change the config from init.d script.
>>>
>>> Best approach would be to bulk index say 10,000 feeds in sync mode , 
>>> wait until is everything is indexed and then proceed to the next batch.
>>> I am not sure about the java API , but long back i used to curl to this 
>>> stats API and see how much request was rejected.
>>>
>>> Thanks
>>>           Vineeth
>>>
>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <[email protected]> wrote:
>>>
>>>> You also said you wouldn't recommend indexing that much information at 
>>>> once. How would you suggest breaking it up and what status should I look 
>>>> for before doing another batch? I have to come up with some process that 
>>>> is 
>>>> repeatable and mostly automated. 
>>>>
>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>>>>>
>>>>> Thanks for the reply, Vineeth! 
>>>>>
>>>>> What's a practical heap size? I've seen some people saying they set it 
>>>>> to 30gb but this confuses me because in the /etc/default/elasticsearch 
>>>>> file, the comment suggests the max is only 1gb? 
>>>>>
>>>>> I'll look into the threadpool issue. Is there a Java API for 
>>>>> monitoring Cluster Node health? Can you point me at an example or give me 
>>>>> a 
>>>>> link to that? 
>>>>>
>>>>> Thanks! 
>>>>>
>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:
>>>>>>
>>>>>> Hello Joshuva ,
>>>>>>
>>>>>> I have a feeling this has something to do with the threadpool.
>>>>>> There is a limit on number of feeds to be queued for indexing.
>>>>>>
>>>>>> Try increasing the size of threadpool queue of index and bulk to a 
>>>>>> large number.
>>>>>> Also through cluster node API on threadpool, you can see if any 
>>>>>> request has failed.
>>>>>> Monitor this API for any failed request due to large volume.
>>>>>>
>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>>> reference/current/modules-threadpool.html
>>>>>> Threadpool stats - http://www.elasticsearch.org
>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
>>>>>>
>>>>>> Having said that , i wont recommend bulk indexing that much 
>>>>>> information at a time and 512 MB is not going to help much.
>>>>>>
>>>>>> Thanks
>>>>>>           Vineeth
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <[email protected]> wrote:
>>>>>>
>>>>>>> Hi there! 
>>>>>>>
>>>>>>> I'm trying to do a one-time index of about 800,000 records into an 
>>>>>>> instance of elasticsearch. But I'm having a bit of trouble. It 
>>>>>>> continually 
>>>>>>> fails around 200,000 records. Looking at in the Elasticsearch Head 
>>>>>>> Plugin, 
>>>>>>> my index goes offline and becomes unrecoverable. 
>>>>>>>
>>>>>>> For now, I have it running on a VM on my personal machine. 
>>>>>>>
>>>>>>> VM Config: 
>>>>>>> Ubuntu Server 14.04 64-Bit
>>>>>>> 8 GB RAM
>>>>>>> 2 Processors
>>>>>>> 32 GB SSD
>>>>>>>
>>>>>>> Java
>>>>>>> java version "1.7.0_65"
>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1) 
>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2)
>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>>>>>>>
>>>>>>> Elasticsearch is using mostly the defaults. This is the output of: 
>>>>>>> curl http://localhost:9200/_nodes/process?pretty
>>>>>>> {
>>>>>>>   "cluster_name" : "property_transaction_data",
>>>>>>>   "nodes" : {
>>>>>>>     "KlFkO_qgSOKmV_jjj5xeVw" : {
>>>>>>>       "name" : "Marvin Flumm",
>>>>>>>       "transport_address" : "inet[/192.168.133.131:9300]",
>>>>>>>       "host" : "ubuntu-es",
>>>>>>>       "ip" : "127.0.1.1",
>>>>>>>       "version" : "1.3.2",
>>>>>>>       "build" : "dee175d",
>>>>>>>       "http_address" : "inet[/192.168.133.131:9200]",
>>>>>>>       "process" : {
>>>>>>>         "refresh_interval_in_millis" : 1000,
>>>>>>>         "id" : 1092,
>>>>>>>         "max_file_descriptors" : 65535,
>>>>>>>         "mlockall" : true
>>>>>>>       }
>>>>>>>     }
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> I adjusted ES_HEAP_SIZE to 512mb. 
>>>>>>>
>>>>>>> I'm using the following code to pull data from SQL Server and index 
>>>>>>> it. 
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com 
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Indexing Problems

Reply via email to