[
https://issues.apache.org/jira/browse/ATLAS-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hemanth Yamijala updated ATLAS-616:
-----------------------------------
Attachment: heap.png
An update:
As described above, all indications to cause of the problem were pointing
towards the weak references that were holding on the GremlinGroovy script
bindings. From what I could see in the code, there are no knobs to adjust /
tune this value in the version of the library we are using.
As a next step, I tried to see whether GC settings could be tuned to accomplish
this, and ran across this link: http://stackoverflow.com/a/604395 which pointed
to a GC config {{-XX:SoftRefLRUPolicyMSPerMB=<value>}}. Likewise, the Sun JDK
documentation
(http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/java.html) says:
bq. -XX:SoftRefLRUPolicyMSPerMB=0 This flag enables aggressive processing of
software references. Use this flag if the software reference count has an
impact on the Java HotSpot VM garbage collector.
Given the above hints, I ran a test with this setting, set to 0 and also to
100. In both cases, the GC performance dramatically improved and I was able to
increase the number of tests to get linear performance. [~ssainath] helped me
to run these tests in a server environment (still with JDK 7) and got similar
results. The attached graph is from a server environment running a total of
3600 queries. We even tested up to 7200 queries. Each run scaled linearly with
time, and the logs had no concurrency issues etc. The GC patterns are stable as
can be seen above.
We are going to test on OpenJDK 8 as well to see what the impact is, and if
things go fine, I can put up a patch that just suggests the settings to enable
on the server for such loads.
For reference, the GC settings I use are:
{code}
export ATLAS_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:MaxNewSize=3072m
-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=512m
-Djava.net.preferIPv4Stack=true -Xmx10240m -Xms10240m
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dumps/atlas_server.hprof -XX:PermSize=100M
-Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dlog4j.configuration=atlas-log4j.xml"
{code}
In addition to this effort, I also plan to write on the Tinkerpop mailing list
to see if they have any suggestions for tuning this / fixing this in code.
> Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large
> scale.
> -------------------------------------------------------------------------------------
>
> Key: ATLAS-616
> URL: https://issues.apache.org/jira/browse/ATLAS-616
> Project: Atlas
> Issue Type: Bug
> Environment: Atlas with External kafka / HBase / Solr
> The test is run on cluster setup.
> Machine 1 - Atlas , Solr
> Machine 2 - Kafka , HBase
> Machine 3 - Hive , client
> Reporter: Sharmadha Sainath
> Assignee: Hemanth Yamijala
> Attachments: baseline-1000-3600-10g-heap.png, heap.png,
> no-dsl-1000-14400-10g-heap.png, zk-exception-stacktrace.rtf
>
>
> The test plan is to simulate 'n' number of users fire 'm' number of queries
> at Atlas simultaneously. This is accomplished with the help of Apache Jmeter.
> Atlas is populated with 10,000 tables.
> • 6000 small sized tables (10 columns)
> • 3000 medium sized tables (50 columns)
> • 1000 large sized tables (100 columns)
> The test plan consists of 30 users firing a set of 3 queries continuously
> for 20 times in a loop. Added -Xmx10240m -XX:MaxPermSize=512m to ATLAS_OPTS .
> Zookeeper throws exceptions when the test plan is run and Jmeter starts
> firing queries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)