[Architecture] Fixing OOM issue in Message Broker

Malinga Purnasiri Sun, 26 May 2013 09:28:58 -0700

Hi,

When I'm working on the OOM issue in the MB, I have found that there is
some design imitations which will lead to it. Let me summaries my findings
in point form.


 ** Static Executor pool (org.wso2.andes.pool.AndesExecuter.java) issue,*
Unfortunately MB had static executor pool where we submit all the runnables
need to execute in parallel. Sooner when we send messages in a burst pool
will get exhausted (with runnables) and which will lead to increase the
internal queue backing the executor pool. Which will lead to OOM. This was
observed from the heap dump analysis with MAT.

      *Solution* : We need to group the runnables based on the task nature
and send them to different executor pools.

 ** The way we inserting data to the Cassandra.*
Currently most of the time we are creating mutators and execute on the fly.
I have done some Benchmarks which simulate this situation. Following is
small code block for that ...

Sample ..
    for (int i = 0; i < 1000000; i++) {
Mutator<String> messageMutator = HFactory.createMutator(keyspaceOperator,
stringSerializer);
messageMutator.addInsertion(..);
messageMutator.execute();
    }

When we looping and doing the execution on the fly, after few thousands
latency it takes very long time than we expected to execute the insertion.

        According to the Cassandra documentations this will lead to another
side effect of sending so many messages over network, which will exhaust
network bandwidth too,

*Solution : Solution will be operate in batch mode. ex : BatchMutation*

  ** Message accumulation in LinkedBlockingQueue (observed from heap dump
with MAT)*
Inside the CassendraMessageStore we have used LinkedBlockingQueue to store
the messages temporary till we insert messages to Cassandra. But in there
we have huge bottleneck, producer to the queue inserting so fast but
consumer end its very slow. So this will lead to increase the blocking
queue and crate OOM.

*Solution : Add BatchMutation mode at the consumer end to make the consumer
fast. So Queue will have less messages in given time.*

  ** PublishMessageWriter run method is serial execution*
Inside the thread, we just take messages one by one and try to insert to
Cassandra, but when we have burst of messages this will create a
bottleneck. So we must introduce more parallelism to this.


---------

Any ideas on this ?,

Note : I have done those above code changes and now MB can take messages in
a burst and work without OOM. But we need to design this and implement this
in a production quality mode.

-- 
Malinga Pathmal,
Technical Lead, WSO2, Inc. : http://wso2.com/
Phone : (+94) 715335898

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] Fixing OOM issue in Message Broker

Reply via email to