Re: [oae-dev] SAKAI3_JAVA_OPTS show and tell

Lance Speelmon Tue, 21 Aug 2012 08:57:57 -0700

# file managed by puppet

# filled out via puppet templating
############################ \ this is not a typo / set min and max to 
javamemorymax
export JAVA_OPTS="-server -Xms<%= javamemorymax %> -Xmx<%= javamemorymax %> \
-XX:PermSize=<%= javapermsize %> -XX:MaxPermSize=<%= javapermsize %> \
-XX:CMSInitiatingOccupancyFraction=70 \
-XX:NewRatio=3 -XX:-UseAdaptiveSizePolicy \
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled \
-XX:MaxTenuringThreshold=0 -XX:-DisableExplicitGC \
-XX:+UseCMSInitiatingOccupancyOnly \
-Djava.awt.headless=true -verbose:gc \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintTenuringDistribution \
-XX:+PrintCommandLineFlags"


# http://randomlyrr.blogspot.it/2012/03/java-tuning-in-nutshell-part-1.html

# -Xmx should be equal to -Xms Growing from Xms to Xmx requires Full GC’s to
# resize the heap. Set these to the same value if Full GC’s are to be completely
# eliminated in production.

# –XX:PermSize should be equal to –XX:MaxPermSize
# Both params need to be specified and should have the same value. Otherwise,
# a full GC is required for each Perm Gen resize while it grows up to 
MaxPermSize

# –XX:NewSize is specified but not equal to –XX:MaxNewSize
# Like the other heap params, resize of new/young gen requires a Full GC. The
# preferred approach is to avoid these two parameters and use -Xmn instead.
# This eliminates the problem as setting, say "-Xmn1g", is the same as setting
# "-XX:NewSize=1g -XX:MaxNewSize=1g".

# Although UseConcMarkSweepGC is specified, CMS can and often will kick in too
# late, causing a Full GC when it can’t catch up. In other words, although CMS
# is collecting garbage, the application threads that are executing concurrently
# run out of heap for allocation because CMS couldn't free garbage soon enough.
# At this point, the JVM stops all application threads and does a Full GC.
# This is also called a “concurrent mode failure” in GC logs. The reason for
# concurrent mode failure - the JVM dynamically finds a value for when CMS
# should be initiated and changes this value based on statistics. However, in
# production, load is often bursty which leads to misses/miscalculation for the
# last dynamically computed initiation value. To prevent this, provide a static
# value for CMSInitiation. Use –XX:CMSInitiatingOccupancyFraction (as percentage
# of total heap) to tell the JVM what point it should initiate CMS. A value
# between 40 to 70 usually works for most Fusion middleware products. Start
# with the higher value (70) and tune down only if you still see the string
# "concurrent mode failure" in GC log

# Secondly, always specify –XX:+UseCMSInitiatingOccupancyOnly when
# CMSInitiatingOccupancyFraction is used, otherwise the value you specify
# does not stick (JVM will dynamically change it on the fly again). This is
# very important and commonly missed.

# -XX:+UseCompressedOops  Highly recommended on 64-bit JVM's with an Xmx value
# less than 32g. However, this is available only on JDK6 update 14+.


On Aug 20, 2012, at 7:40 PM, "Walters, Beren" <bwalt...@csu.edu.au> wrote:

> What sort of SAKAI3_JAVA_OPTS does everyone else use?
> 
> What sort of environment is this in (physical memory, local solr servers, 
> etc)?
> 
> Is anyone using the concurrent collector to try and reduce application pauses?
> 
> We are running the options shown below on two (virtual) app servers, each 
> with 4GB of memory, with separate solr and database servers, no garbage 
> collector specified.
> 
> Thanks,
> Beren
> 
> -----Original Message-----
> From: Branden Visser [mailto:mrvis...@gmail.com]
> Sent: Monday, 20 August 2012 8:40 PM
> To: Walters, Beren
> Cc: oae-dev@collab.sakaiproject.org
> Subject: Re: [oae-dev] OAE-model-loader
> 
> Thanks for the graphs, Beren. Given the spike in GC activity around
> the times that the loading failed, there is some substance to the
> theory that the JVM was struggling with memory.
> 
> Cheers,
> Branden
> 
> On Sun, Aug 19, 2012 at 11:07 PM, Walters, Beren <bwalt...@csu.edu.au> wrote:
>> Hi Brandon,
>> 
>> I was just looking at the total server memory using the free command. Not 
>> very useful in retrospect.
>> 
>> We are currently running using the following java options:
>> SAKAI3_JAVA_OPTS="-Xmx1500m -XX:MaxPermSize=256m -server 
>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=****** 
>> -Dcom.sun.management.jmxremote.ssl=false 
>> -Dcom.sun.management.jmxremote.password.file=****** -Djava.security.manager 
>> -Djava.security.policy=****** -Djava.awt.headless=true -Dhttp.proxySet=true 
>> -Dhttp.proxyHost=****** -Dhttp.proxyPort=****** 
>> -Dhttp.nonProxyHosts='******' -Dhttp.proxyUser=****** 
>> -Dhttp.proxyPassword=******* -Dcom.sun.management.snmp.port=****** 
>> -Dcom.sun.management.snmp.acl.file=******"
>> 
>> So I guess the server only had 1.5GB allocated for the JVM its self.
>> 
>> We had some monitoring running while I ran the two test this morning:
>> 
>> https://oae-community.sakaiproject.org/content#p=mTbbqA15C/PS-marksweep.png
>> 
>> https://oae-community.sakaiproject.org/content#p=mTbbjAU7aa/PS-scavenge.png
>> 
>> The first run ran until about 10:45am and the second till about 11:30am.
>> 
>> I'm going to have to run the JVM changes through approval (even if it is 
>> only temporary) so it may take me a while to produce the log you are after.
>> 
>> Unfortunately I'm not able to run the model loader on the app server its 
>> self.
>> 
>> Thanks for the package.js info.
>> 
>> Cheers,
>> Beren.
>> 
>> -----Original Message-----
>> From: Branden Visser [mailto:mrvis...@gmail.com]
>> Sent: Monday, 20 August 2012 12:21 PM
>> To: Walters, Beren
>> Cc: oae-dev@collab.sakaiproject.org
>> Subject: Re: [oae-dev] OAE-model-loader
>> 
>> Hi Beren,
>> 
>> When you say that the app server has 4GB of memory, do you mean the
>> JVM is configured with 4GB in the startup params (e.g., -Xmx)? The JVM
>> itself may have less allocated space. To see if you're running into
>> significant garbage-collection issues, try and enable the verbose
>> garbage collector logs on the app server. If they're already enabled,
>> then its output would be useful if you could attach it.
>> 
>> Here are some relevant parameters:
>> 
>> java -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>> -Xloggc:<path/to/output/file.log> ...
>> 
>> The garbage collector is capable of completely locking up the JVM to
>> clean out de-referenced objects. If it locks for a significant amount
>> of time, timeouts may occur. As you've found with the
>> OAE-model-loader, it only takes one timeout to crash it.
>> 
>> Another possibility is intermittent network connectivity? Or maybe a
>> firewall / IDP causing issues? If possible, you could try running the
>> OAE-model-loader from the same machine as the app server. This can
>> actually greatly improve the loading time as well if there is a
>> relatively larger cost establishing an HTTP connection over the
>> network.
>> 
>> Also, I forgot to reply about the lack of the package.js script in the
>> OAE-model-loader. The PR [1] that adds this functionality is actually
>> still outstanding, I guess I got a little excited with the
>> documentation. To get a head start, you could work from my
>> performance-testing branch for now [2]. Everything about generating
>> and loading the data is still the same, though. So no need to redo
>> anything by taking in those changes.
>> 
>> Hope that helps,
>> Branden
>> 
>> [1] https://github.com/sakaiproject/OAE-model-loader/pull/28
>> [2] https://github.com/mrvisser/OAE-model-loader/tree/performance-testing
>> 
>> On Sun, Aug 19, 2012 at 9:35 PM, Walters, Beren <bwalt...@csu.edu.au> wrote:
>>> 
>>> Hi Branden,
>>> 
>>> 
>>> 
>>> The app server has 4GB of memory (and 6GB of swap) and does not run the 
>>> solr or database services. I did have to bump the memory to 1.5GB on the 
>>> debian VM I use for running OAE-model-loader as it kept being killed by the 
>>> kernels OOM process killer during the generate phase.
>>> 
>>> 
>>> 
>>> I'm running the import across a 100mb wired network which never seems to 
>>> exceed about 2-3% utilisation, generally more like 0.5-1%.
>>> 
>>> 
>>> 
>>> Run 1: Using the previously run data.
>>> 
>>> =====
>>> 
>>> 
>>> 
>>> At start of run (100 batches of 500 users, 2 worlds, 2 content, 2 
>>> collections) the server is using 1.6GB of memory.
>>> 
>>> 
>>> 
>>> After 180 of the first users in the batch it has hit 2.2GB of used memory.
>>> 
>>> 
>>> 
>>> After 321 of the first 500 user batch it has reached 2.25GB.
>>> 
>>> 
>>> 
>>> After 481 of the first 500 it has reached 2.28GB.
>>> 
>>> 
>>> 
>>> It appears to have finished the users in that batch at this point then died 
>>> (ECONNREFUSED) while loading Contact 244 of 6879.
>>> 
>>> 
>>> 
>>> I was able to keep using the app server during and after this load test - 
>>> browsing content etc.
>>> 
>>> 
>>> 
>>> I ran packet captures on the VM where I run OAE-model-loader when I first 
>>> hit this issue and it appeared that the loader was trying to create a TCP 
>>> connection for the HTTP transaction but never received a response before 
>>> hitting some timeout.
>>> 
>>> 
>>> 
>>> Attached is the log from this load test (import-oaeappdev01-logs2.zip). I 
>>> use 2>&1 when running the test so error out may be mixed with the standard 
>>> out and this file is 12+MB as it contains all of the errors for failing to 
>>> insert existing users and contacts.
>>> 
>>> 
>>> 
>>> Run 2: Reran the generate.js script using the same settings before this 
>>> test.
>>> 
>>> 
>>> 
>>> At start of run (100 batches of 500 users, 2 worlds, 2 content, 2 
>>> collections) the server is using 2.32GB of memory. I assume this is due to 
>>> the Linux virtual memory strategy, it won't release the memory until it 
>>> reaches the cache pressure threshold.
>>> 
>>> 
>>> 
>>> After 141 users of first batch it is at 2.36GB.
>>> 
>>> 
>>> 
>>> At user 189 I saw an error in the log: Could not create user 
>>> batch0-lory-turnell-217 because No live SolrServers available to handle 
>>> this request
>>> 
>>> This didn't abort the load as per the ECONNREFUSED error.
>>> 
>>> 
>>> 
>>> After 250 users of first batch it is at 2.39GB.
>>> 
>>> 
>>> 
>>> After 360 users of the first batch it 2.43GB.
>>> 
>>> 
>>> 
>>> At user 460 I received the solrserver error again.
>>> 
>>> 
>>> 
>>> It then died (ECONNREFUSED) while loading contact 360 of 4954.
>>> 
>>> 
>>> 
>>> The app server kept running fine during and after this load.
>>> 
>>> 
>>> 
>>> Attached is the log from this load test (import-oaeappdev01-logs3.zip)
>>> 
>>> 
>>> 
>>> Let me know if there are any more details I can provide. Perhaps monitoring 
>>> the solr servers (we run a master + slave config)?
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Beren.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Branden Visser [mailto:mrvis...@gmail.com]
>>> Sent: Monday, 20 August 2012 10:19 AM
>>> To: Walters, Beren
>>> Subject: Re: [oae-dev] OAE-model-loader
>>> 
>>> 
>>> 
>>> Hi Beren, when that happens, is the app server responsive at all?
>>> 
>>> You may be running out of memory. I managed to load 5000 users on my 
>>> MacBook by feeding the server 3gb of memory. I was running solr embedded 
>>> with postgres on the MacBook as well.
>>> 
>>> Hope that helps,
>>> Branden
>>> 
>>> On Aug 19, 2012 7:08 PM, "Walters, Beren" <bwalt...@csu.edu.au> wrote:
>>> 
>>> Hi All,
>>> 
>>> 
>>> 
>>> I'm having a bit of trouble with the OAE-model-loader.
>>> 
>>> 
>>> 
>>> I can't add more than about 1000 users without the load failing with this 
>>> error:
>>> 
>>> 
>>> 
>>> events.js:66
>>> 
>>>        throw arguments[1]; // Unhandled 'error' event
>>> 
>>>                       ^
>>> 
>>> Error: connect ECONNREFUSED
>>> 
>>>    at errnoException (net.js:782:11)
>>> 
>>>    at Object.afterConnect [as oncomplete] (net.js:773:19)
>>> 
>>> 
>>> 
>>> I have tried regenerating the data files in case there was an error but it 
>>> happens on every load.
>>> 
>>> 
>>> 
>>> I have already split the load into <500 per batch and run non-concurrent 
>>> loads but this problem persists.
>>> 
>>> 
>>> 
>>> Even loading the existing users again (receiving item already exists HTTP 
>>> responses) the load will still fail, often before it gets to the users that 
>>> have not yet been loaded.
>>> 
>>> 
>>> 
>>> Does anyone have any ideas on how to solve this issue? Could the error be 
>>> caught? The server doesn't seem to be heavily loaded during the process and 
>>> continues working after the failure with nothing obvious in the logs.
>>> 
>>> 
>>> 
>>> I also haven't been able to find the package.js script referred to at 
>>> https://confluence.sakaiproject.org/display/3AK/Performance+Testing+Methodology#PerformanceTestingMethodology-LoadingSourceDataintoOAE
>>>  could someone point me in the direction of it?
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Beren.
>>> 
>>> 
>>> 
>>> |   ALBURY-WODONGA   |   BATHURST   |   CANBERRA   |   DUBBO   |   GOULBURN 
>>>   |   MELBOURNE   |   ONTARIO   |   ORANGE   |   PORT MACQUARIE   |   
>>> SYDNEY   |   WAGGA WAGGA   |
>>> 
>>> ________________________________
>>> 
>>> LEGAL NOTICE
>>> This email (and any attachment) is confidential and is intended for the use 
>>> of the addressee(s) only. If you are not the intended recipient of this 
>>> email, you must not copy, distribute, take any action in reliance on it or 
>>> disclose it to anyone. Any confidentiality is not waived or lost by reason 
>>> of mistaken delivery. Email should be checked for viruses and defects 
>>> before opening. Charles Sturt University (CSU) does not accept liability 
>>> for viruses or any consequence which arise as a result of this email 
>>> transmission. Email communications with CSU may be subject to automated 
>>> email filtering, which could result in the delay or deletion of a 
>>> legitimate email before it is read at CSU. The views expressed in this 
>>> email are not necessarily those of CSU.
>>> 
>>> Charles Sturt University in Australia The Grange Chancellery, Panorama 
>>> Avenue, Bathurst NSW Australia 2795 (ABN: 83 878 708 551; CRICOS Provider 
>>> Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)). TEQSA Provider Number: 
>>> PV12018
>>> Charles Sturt University in Ontario 860 Harrington Court, Burlington 
>>> Ontario Canada L7N 3N4 Registration: www.peqab.ca
>>> 
>>> Consider the environment before printing this email.
>>> 
>>> 
>>> _______________________________________________
>>> oae-dev mailing list
>>> oae-dev@collab.sakaiproject.org
>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>> Charles Sturt University
>> 
>> | ALBURY-WODONGA | BATHURST | CANBERRA | DUBBO | GOULBURN | MELBOURNE | 
>> ONTARIO | ORANGE | PORT MACQUARIE | SYDNEY | WAGGA WAGGA |
>> 
>> LEGAL NOTICE
>> This email (and any attachment) is confidential and is intended for the use 
>> of the addressee(s) only. If you are not the intended recipient of this 
>> email, you must not copy, distribute, take any action in reliance on it or 
>> disclose it to anyone. Any confidentiality is not waived or lost by reason 
>> of mistaken delivery. Email should be checked for viruses and defects before 
>> opening. Charles Sturt University (CSU) does not accept liability for 
>> viruses or any consequence which arise as a result of this email 
>> transmission. Email communications with CSU may be subject to automated 
>> email filtering, which could result in the delay or deletion of a legitimate 
>> email before it is read at CSU. The views expressed in this email are not 
>> necessarily those of CSU.
>> 
>> Charles Sturt University in Australia  http://www.csu.edu.au  The Grange 
>> Chancellery, Panorama Avenue, Bathurst NSW Australia 2795  (ABN: 83 878 708 
>> 551; CRICOS Provider Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)). 
>> TEQSA Provider Number: PV12018
>> 
>> Charles Sturt University in Ontario  http://www.charlessturt.ca 860 
>> Harrington Court, Burlington Ontario Canada L7N 3N4  Registration: 
>> www.peqab.ca
>> 
>> Consider the environment before printing this email.
> Charles Sturt University
> 
> | ALBURY-WODONGA | BATHURST | CANBERRA | DUBBO | GOULBURN | MELBOURNE | 
> ONTARIO | ORANGE | PORT MACQUARIE | SYDNEY | WAGGA WAGGA |
> 
> LEGAL NOTICE
> This email (and any attachment) is confidential and is intended for the use 
> of the addressee(s) only. If you are not the intended recipient of this 
> email, you must not copy, distribute, take any action in reliance on it or 
> disclose it to anyone. Any confidentiality is not waived or lost by reason of 
> mistaken delivery. Email should be checked for viruses and defects before 
> opening. Charles Sturt University (CSU) does not accept liability for viruses 
> or any consequence which arise as a result of this email transmission. Email 
> communications with CSU may be subject to automated email filtering, which 
> could result in the delay or deletion of a legitimate email before it is read 
> at CSU. The views expressed in this email are not necessarily those of CSU.
> 
> Charles Sturt University in Australia  http://www.csu.edu.au  The Grange 
> Chancellery, Panorama Avenue, Bathurst NSW Australia 2795  (ABN: 83 878 708 
> 551; CRICOS Provider Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)). 
> TEQSA Provider Number: PV12018
> 
> Charles Sturt University in Ontario  http://www.charlessturt.ca 860 
> Harrington Court, Burlington Ontario Canada L7N 3N4  Registration: 
> www.peqab.ca
> 
> Consider the environment before printing this email.
> _______________________________________________
> oae-dev mailing list
> oae-dev@collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/oae-dev

_______________________________________________
oae-dev mailing list
oae-dev@collab.sakaiproject.org
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] SAKAI3_JAVA_OPTS show and tell

Reply via email to