Hi,

as you noticed I added a branch with performance metrics. Add did rerun
tests with similar and large user numbers.

Results

   - *80 users test*
   test: 
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test
   
<https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test>
   - *140 users test*
   
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test

Findings

*1 - Added index in address.email*

I actually did the *80 users test* twice. I found that I could improve the
performance of the login command by adding some index into address.email
See:
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-CPUandmemoryusage
<https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-CPUandmemoryusage>

*2 - Login command performance increasingly bad with number of users*

While I could improve some of the UserService::login (or generally
UserDao::login) with the index, *then I switched from 80 users to 140 users*
and rerun the tests: *The login command started to take 10+ seconds*

   - I also used an actual log command to verify my metrics because it just
   seemed so strange !
   - I also checked other methods, they are NOT affected. Or at least by a
   very far margin increase very slightly. It's NOT generally slow, but some
   part of the application is slow!
   - You can see how the duration increases in length here during the test
   run:
   
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-Loginwebservicecallhits10seconds
   - And you can see here that its not just the HTTP call that takes long
   but the actual UserDao:login command:
   
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-DatabaseUserDaologinmethodisbyfarthelongestrunningoneandhitsplus10seconds

=> Now this is kind of puzzling. Other methods don't perform that
increasingly bad, its just this login command. Like RoomDao::method do not
perform badly. Or not at that scale (by far not).

*Questions:*

   - Is it possible that the login method that fetches the entire user has
   too many entities linked? SO that even a simple fetch or login starts to
   take a long time quickly ?
   - Is it really possible that this "only this method" takes so long or am
   I measuring a general issue because the CPU spike is at 95% ? Like I say:
   RoomDao seems fine, are there other methods I should measure ?

*3 - Tomcat threads unused or way below max*

As you can see between those two graphs:

   - 80 users test active threads:
   
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-Tomcatactivethreads
   - 140 users test active threads:
   
https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-Tomcatactivethreads

=> 140 users utilize more threads. YES. But its WAY below the 400
available. Not even closely to that number. I don't think its a threading
issue. It also makes sense. Cause given the test scenario you probably have
around 20-40 users hitting the server trying to login and enter the room at
the same time.

*4 - RoomPanel:onInitialise and RoomPanel:enterRoom*

I put some metrics on those methods, you can see the results of the 140
users test run here:
http://54.162.44.21:5080/graph?g0.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%2Cclass%3D%22RoomPanel%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%2Cclass%3D%22RoomPanel%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00
 => It is definitely getting worse BUT this is MILLISECONDS. It increased
0.05seconds to 0.3seconds. That's getting slower but not that bad.

If you compare the same graph with the login command:
http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2Fuser%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2Fuser%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00
 => That is going from ~800ms to 12seconds!

It shows some similarity in the curve. But the durations are very different!

*5 - Missing video pods massively increased from 80 to 140 users*

During that time when login is slow the amount of conference rooms where
video pods are missing is massively increased.

That could be because the OpenMeetings Server is generally slow. Could be
because something is particularly slow within fetching the user entity or
login command.

*6 - Missing video always issue on sender side*

>From my tests I could see that - if a video is missing that is because a
SENDER has:
A) Not started the video
B) Stuck in the video being published

Like I say, the case of (a) it MASSIVELY increased once the server had 140
users. But I think it's valuable information to know that there seem to be
TWO issues. One related with video pods not starting up, the other around
once the video pod started, the video stream doesn't somehow get triggered.


Generally - digging through above graphs and reports

All above metrics and test runs you can dig through yourself. I will
obviously delete those dashboards shortly again. But you can have a dig
through it:

   - Dashboard for v5.1.0 with additional indexes on Address.email
   
<http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2000%3A30%3A00&g0.moment_input=2021-02-05%2000%3A30%3A00&g1.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%7D%5B1m%5D)&g1.tab=0&g1.stacked=0&g1.range_input=15m&g1.end_input=2021-02-05%2000%3A30%3A00&g1.moment_input=2021-02-05%2000%3A30%3A00&g2.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22database%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22database%22%7D%5B1m%5D)&g2.tab=0&g2.stacked=0&g2.range_input=15m&g2.end_input=2021-02-05%2000%3A30%3A00&g2.moment_input=2021-02-05%2000%3A30%3A00&g3.expr=tomcat_threads_active_total&g3.tab=0&g3.stacked=0&g3.range_input=15m&g3.end_input=2021-02-05%2000%3A30%3A00&g3.moment_input=2021-02-05%2000%3A30%3A00>
   - Test with 140 user at 2021-02-05 02:52:00
   
<http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00&g1.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%7D%5B1m%5D)&g1.tab=0&g1.stacked=0&g1.range_input=15m&g1.end_input=2021-02-05%2002%3A52%3A00&g1.moment_input=2021-02-05%2002%3A52%3A00&g2.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22database%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22database%22%7D%5B1m%5D)&g2.tab=0&g2.stacked=0&g2.range_input=15m&g2.end_input=2021-02-05%2002%3A52%3A00&g2.moment_input=2021-02-05%2002%3A52%3A00&g3.expr=tomcat_threads_active_total&g3.tab=0&g3.stacked=0&g3.range_input=15m&g3.end_input=2021-02-05%2002%3A52%3A00&g3.moment_input=2021-02-05%2002%3A52%3A00>

If you go to those dashboards it will help probably a lot to ready if you
have a look at:
https://cwiki.apache.org/confluence/display/OPENMEETINGS/Prometheus+Logging+and+Metrics

*What else ????*


   - What are we thinking on the above findings ? Is login method a
   performance issue or just a symptom of a generally slow server ? It doesn't
   really look like a general issue. The difference between UserDao:login and
   EVERYTHING else seems just too drastic
   - What other methods and calls should I try to trace and measure ?
   - Is there any event within initialising a video pod that I could
   measure and metric ?

Any other ideas ?

Thanks
Sebastian

Sebastian Wagner
Director Arrakeen Solutions, OM-Hosting.com
http://arrakeen-solutions.co.nz/
https://om-hosting.com - Cloud & Server Hosting for HTML5
Video-Conferencing OpenMeetings
<https://www.youracclaim.com/badges/da4e8828-743d-4968-af6f-49033f10d60a/public_url>
<https://www.youracclaim.com/badges/b7e709c6-aa87-4b02-9faf-099038475e36/public_url>

Reply via email to