[DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

Ali Nazemian Thu, 04 Apr 2019 00:36:26 -0700

Hi All,

As far as I understood, there is a plan to change the real-time engine of
Metron due to some issues that user and developer have been facing with it.
I would like to explain some critical issues that customer have been facing
to clarify it for the development team what the best approach could be for
the future of Metron. Based on the experience we have had with Metron there
are two important issues that cause lots of problems from the technology
and business:


- Infrastructure cost
- Operational complexity

We have had lots of issues to minimize infrastructure cost. We have also
spent significant time to tune infrastructure to be able to reduce the
cost. However, regardless of what had been done, we were not able to manage
our cost properly. The main reason for that is the rate of log ingestion
has been very fluctuating. It means we were receiving 4k eps on a sensor
during the peak time and less than 1 eps off-peak (e.g. during night). The
problem with that is you want to have an environment that can easily *scale
up* and *scale down* based on your ingestion traffic. Not to mention that
there have been situations where we cannot even predict the ingestion rate
as there has been a sort of cyber attach where lots of logs are generated
from the source devices. For example, DDOS might be one of the scenarios
that lots of logs are generated.

When it comes to operational complexity, we have had lots of issues to
manage sensors and tune different parameters based on the traffic we
receive. We have had lots of failures as well due to different reasons and
we spent a fair amount of time to write scripts that can be simulated
*self-healing* feature at a very basic level. In the production use case,
we need to be able to respond to different situations very quickly. For
example, if a service is down, bring it up automatically or if a new sensor
is onboarded make sure that there won't be any risk to other services. We
also have lots of discussion about how we can create different processes or
automation tests to make sure nothing can go wrong. However, this made us
to create lots of platforms to test something from different aspects which
increases our cost even more. We didn't have the capability of provisioning
a short-lived environment once a PR is submitted. We really miss an ability
to *provision an environment very quickly*. We really needed to have a
capability to isolate different sensors and different use cases entirely
from not only parser topology, but also enrichment and indexing topologies.
We needed a good mechanism for *change isolation*.

I understand that the requirements of running an application on Cloud would
be different than on-premise. However, the majority of them are quite the
same when it comes to running Metron in production.

We have recently delivered a data processing pipeline project using more
cloud-native architectures and we have found out that how similar the
concerns have been and how easily Kubernetes helped us to manage these
problems with providing native support for scale-up and scale-downs,
self-healing, being able to provision a short-lived environment very
quickly and isolate our changes via canary and blue-green deployments. Of
course, following 12-factors were a big important principle for us to
manage those concerns. We have used Spring Cloud Stream to create an
event-driven data processing pipeline for this matter and some other
complementary frameworks provided by Pivotal. What has come to my mind is,
if other customer experiences of using Metron in production were similar to
our experience and they had had the same sort of concerns, can migrating
from Storm to Event-Driven Pipeline help all users to have a better
experience with running Metron in production? Of course, I have not been
across other user challenges so I cannot answer that, but it is just an
idea.

There is no doubt that we can have all these features by using Spark as
well in future, but it requires more time to build the integration and some
of these functionalities are not going to be available very soon. It is
just a thought that the Metron architecture is already Event-Driven at some
stages and state-less by nature. Which makes it a good fit for using an
event-driven pipeline to deploy it on containers.

Cheers,
Ali

[DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

Reply via email to