The socketserver is now completely reliant on Redis, using Redis' pub
/ sub functionality: http://redis.io/topics/pubsub
The reason for this is that I was using the websocket server to handle
all websockets functionality for the site I'm being paid to work on
and it started running into problems as the site grew, the first issue
was an easy fix after Alex pointed me to it, increasing the amount of
file descriptors in src64/sys/x86-64.linux.defs.l, my line #115 now
looks like this: (equ FD_SET 1024) # 1024 bit
After re-compiling I could easily handle more than 500 clients and all
was well for a while.
Unfortunately the site is growing so fast that just some month(s)
later the parent / root process started intermittently running at 100%
CPU utilization and the service stopped working for perhaps 10-20
minutes before resolving on its own. At this point peak usage involved
2000 clients being connected at the same time.
Alex suspects that the issue has got to do with how the internal logic
handles new processes being created when there are already a lot of
them present. In a normal HTTP server scenario this probably never
happens, imagine that every request takes on average 1 second to
perform before the socket closes, you would then need about 2000
requests per second in order to trigger the CPU problem, you'll run
into many other issues long before that happens in a non-trivial
scenario (trust me I've tested).
In the end we switched over to a node.js based solution that also
relies on Redis' pub / sub functionality (that's where I got the idea
from to make the PL based solution also use it).
I have tried to replicate the real world situation load wise and
number of clients wise but not been able to trigger the CPU issue
(this also seems to imply that Alex's suspicion is not completely on
target), it's impossible for me to replicate the real world situation
since I can't commandeer hundreds of machines all over the world to
connect to my test server. What I did manage to trigger though was
fairly high CPU usage in the child processes though, a situation that
also involved loss of service. After the switch to using pub / sub I
haven't been able to trigger it, so that's a win at least.
Now for the real improvement, actually making HTTP requests to publish
something becomes redundant when publishing from server to client
since it's just a matter of issuing a publish call directly to Redis
instead. That lowers the amount of process creation by more than 90%
in my use case.
Even though I can't be 100% sure as it currently stands I believe that
if I had implemented the websocket server using Redis' pub / sub to
begin with the CPU issue would probably never have happened and there
would've been no need to switch over to node.js.
That being said, this type of service / application is better suited
for threads since the cost in RAM etc is lower.
Final note, my decision to use one socket per feature was poor, it
allowed me a simpler architecture but had I opted for one socket with
"routing" logic implemented in the browser instead I could have
lowered the amount of simultaneous sockets up to 8 times. Peak usage
would then have been 2000 / 8 = 250 processes. Not only that, it turns
out that IE (yes, even version 11 / edge) only allows 6 simultaneous
sockets (including in iframes) per page. We've therefore been forced
to turn off for instance the tournament functionality for IE users.
On Fri, Jun 26, 2015 at 9:30 PM, Henrik Sarvell <hsarv...@gmail.com> wrote:
> Hi all, after over a month without any of the prior issues I now
> consider the websockets part of pl-web stable:
> https://bitbucket.org/hsarvell/pl-web Gone are the days of 100% CPU
> usage and zombie processes.
> With Alex's help the main web server is now more stable (he made me
> throw away a few throws in favour of a few byes). The throws were
> causing the zombies.
> I was also including dbg.l (it was causing hung processes at 100%
> CPU), it's basically been deprecated or something, I'll leave it up to
> him to elaborate. It's just something I've been including by habit
> since years ago when at some point I needed to include it to do some
> kind of debugging.
> Anyway atm the WS router is regularly routing up to 40 messages per
> second to upwards 300-500 clients which means that roughly 20,000
> messages are being pushed out per second during peak hours.
> The PL processes show up with 0 CPU and 0 RAM usage when I run top,
> sometimes 1% CPU :) They hardly register even i aggregate, the server
> would be running 99% idle if it was only running the WS server.
> To work around the inter-process limit of 4096 byte long messages the
> router now supports storing the messages in Redis (raw disk is also
> supported if Redis is not available), this is also in effect in
> production and is working flawlessly since months.
> This is how I start the WS server in production:
> (load "pl-web/pl-web.l")
> (setq *Mobj (new '(+Redis) "pl-ws-"))
> (undef 'app)
> (setq *WsAuth '(("notifications" (("send" ("put your password/key here"))))))
> (de app ()
> (de go ()
> (server 9090) )