Hi, as you noticed I added a branch with performance metrics. Add did rerun tests with similar and large user numbers.
Results - *80 users test* test: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test <https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test> - *140 users test* https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test Findings *1 - Added index in address.email* I actually did the *80 users test* twice. I found that I could improve the performance of the login command by adding some index into address.email See: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-CPUandmemoryusage <https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-CPUandmemoryusage> *2 - Login command performance increasingly bad with number of users* While I could improve some of the UserService::login (or generally UserDao::login) with the index, *then I switched from 80 users to 140 users* and rerun the tests: *The login command started to take 10+ seconds* - I also used an actual log command to verify my metrics because it just seemed so strange ! - I also checked other methods, they are NOT affected. Or at least by a very far margin increase very slightly. It's NOT generally slow, but some part of the application is slow! - You can see how the duration increases in length here during the test run: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-Loginwebservicecallhits10seconds - And you can see here that its not just the HTTP call that takes long but the actual UserDao:login command: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-DatabaseUserDaologinmethodisbyfarthelongestrunningoneandhitsplus10seconds => Now this is kind of puzzling. Other methods don't perform that increasingly bad, its just this login command. Like RoomDao::method do not perform badly. Or not at that scale (by far not). *Questions:* - Is it possible that the login method that fetches the entire user has too many entities linked? SO that even a simple fetch or login starts to take a long time quickly ? - Is it really possible that this "only this method" takes so long or am I measuring a general issue because the CPU spike is at 95% ? Like I say: RoomDao seems fine, are there other methods I should measure ? *3 - Tomcat threads unused or way below max* As you can see between those two graphs: - 80 users test active threads: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+80+users+test#OpenMeetings80userstest-Tomcatactivethreads - 140 users test active threads: https://cwiki.apache.org/confluence/display/OPENMEETINGS/OpenMeetings+140+users+test#OpenMeetings140userstest-Tomcatactivethreads => 140 users utilize more threads. YES. But its WAY below the 400 available. Not even closely to that number. I don't think its a threading issue. It also makes sense. Cause given the test scenario you probably have around 20-40 users hitting the server trying to login and enter the room at the same time. *4 - RoomPanel:onInitialise and RoomPanel:enterRoom* I put some metrics on those methods, you can see the results of the 140 users test run here: http://54.162.44.21:5080/graph?g0.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%2Cclass%3D%22RoomPanel%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%2Cclass%3D%22RoomPanel%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00 => It is definitely getting worse BUT this is MILLISECONDS. It increased 0.05seconds to 0.3seconds. That's getting slower but not that bad. If you compare the same graph with the login command: http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2Fuser%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2Fuser%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00 => That is going from ~800ms to 12seconds! It shows some similarity in the curve. But the durations are very different! *5 - Missing video pods massively increased from 80 to 140 users* During that time when login is slow the amount of conference rooms where video pods are missing is massively increased. That could be because the OpenMeetings Server is generally slow. Could be because something is particularly slow within fetching the user entity or login command. *6 - Missing video always issue on sender side* >From my tests I could see that - if a video is missing that is because a SENDER has: A) Not started the video B) Stuck in the video being published Like I say, the case of (a) it MASSIVELY increased once the server had 140 users. But I think it's valuable information to know that there seem to be TWO issues. One related with video pods not starting up, the other around once the video pod started, the video stream doesn't somehow get triggered. Generally - digging through above graphs and reports All above metrics and test runs you can dig through yourself. I will obviously delete those dashboards shortly again. But you can have a dig through it: - Dashboard for v5.1.0 with additional indexes on Address.email <http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2000%3A30%3A00&g0.moment_input=2021-02-05%2000%3A30%3A00&g1.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%7D%5B1m%5D)&g1.tab=0&g1.stacked=0&g1.range_input=15m&g1.end_input=2021-02-05%2000%3A30%3A00&g1.moment_input=2021-02-05%2000%3A30%3A00&g2.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22database%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22database%22%7D%5B1m%5D)&g2.tab=0&g2.stacked=0&g2.range_input=15m&g2.end_input=2021-02-05%2000%3A30%3A00&g2.moment_input=2021-02-05%2000%3A30%3A00&g3.expr=tomcat_threads_active_total&g3.tab=0&g3.stacked=0&g3.range_input=15m&g3.end_input=2021-02-05%2000%3A30%3A00&g3.moment_input=2021-02-05%2000%3A30%3A00> - Test with 140 user at 2021-02-05 02:52:00 <http://54.162.44.21:5080/graph?g0.expr=rate(webapp_metrics_filter_sum%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(webapp_metrics_filter_count%7Bpath%3D~%22%2Fopenmeetings%2Fservices%2F.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=15m&g0.end_input=2021-02-05%2002%3A52%3A00&g0.moment_input=2021-02-05%2002%3A52%3A00&g1.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22application%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22application%22%7D%5B1m%5D)&g1.tab=0&g1.stacked=0&g1.range_input=15m&g1.end_input=2021-02-05%2002%3A52%3A00&g1.moment_input=2021-02-05%2002%3A52%3A00&g2.expr=rate(org_openmeetings_metrics_sum%7Btype%3D%22database%22%7D%5B1m%5D)%0A%2F%0A%20%20rate(org_openmeetings_metrics_count%7Btype%3D%22database%22%7D%5B1m%5D)&g2.tab=0&g2.stacked=0&g2.range_input=15m&g2.end_input=2021-02-05%2002%3A52%3A00&g2.moment_input=2021-02-05%2002%3A52%3A00&g3.expr=tomcat_threads_active_total&g3.tab=0&g3.stacked=0&g3.range_input=15m&g3.end_input=2021-02-05%2002%3A52%3A00&g3.moment_input=2021-02-05%2002%3A52%3A00> If you go to those dashboards it will help probably a lot to ready if you have a look at: https://cwiki.apache.org/confluence/display/OPENMEETINGS/Prometheus+Logging+and+Metrics *What else ????* - What are we thinking on the above findings ? Is login method a performance issue or just a symptom of a generally slow server ? It doesn't really look like a general issue. The difference between UserDao:login and EVERYTHING else seems just too drastic - What other methods and calls should I try to trace and measure ? - Is there any event within initialising a video pod that I could measure and metric ? Any other ideas ? Thanks Sebastian Sebastian Wagner Director Arrakeen Solutions, OM-Hosting.com http://arrakeen-solutions.co.nz/ https://om-hosting.com - Cloud & Server Hosting for HTML5 Video-Conferencing OpenMeetings <https://www.youracclaim.com/badges/da4e8828-743d-4968-af6f-49033f10d60a/public_url> <https://www.youracclaim.com/badges/b7e709c6-aa87-4b02-9faf-099038475e36/public_url>
