Hi All, As far as I understood, there is a plan to change the real-time engine of Metron due to some issues that user and developer have been facing with it. I would like to explain some critical issues that customer have been facing to clarify it for the development team what the best approach could be for the future of Metron. Based on the experience we have had with Metron there are two important issues that cause lots of problems from the technology and business:
- Infrastructure cost - Operational complexity We have had lots of issues to minimize infrastructure cost. We have also spent significant time to tune infrastructure to be able to reduce the cost. However, regardless of what had been done, we were not able to manage our cost properly. The main reason for that is the rate of log ingestion has been very fluctuating. It means we were receiving 4k eps on a sensor during the peak time and less than 1 eps off-peak (e.g. during night). The problem with that is you want to have an environment that can easily *scale up* and *scale down* based on your ingestion traffic. Not to mention that there have been situations where we cannot even predict the ingestion rate as there has been a sort of cyber attach where lots of logs are generated from the source devices. For example, DDOS might be one of the scenarios that lots of logs are generated. When it comes to operational complexity, we have had lots of issues to manage sensors and tune different parameters based on the traffic we receive. We have had lots of failures as well due to different reasons and we spent a fair amount of time to write scripts that can be simulated *self-healing* feature at a very basic level. In the production use case, we need to be able to respond to different situations very quickly. For example, if a service is down, bring it up automatically or if a new sensor is onboarded make sure that there won't be any risk to other services. We also have lots of discussion about how we can create different processes or automation tests to make sure nothing can go wrong. However, this made us to create lots of platforms to test something from different aspects which increases our cost even more. We didn't have the capability of provisioning a short-lived environment once a PR is submitted. We really miss an ability to *provision an environment very quickly*. We really needed to have a capability to isolate different sensors and different use cases entirely from not only parser topology, but also enrichment and indexing topologies. We needed a good mechanism for *change isolation*. I understand that the requirements of running an application on Cloud would be different than on-premise. However, the majority of them are quite the same when it comes to running Metron in production. We have recently delivered a data processing pipeline project using more cloud-native architectures and we have found out that how similar the concerns have been and how easily Kubernetes helped us to manage these problems with providing native support for scale-up and scale-downs, self-healing, being able to provision a short-lived environment very quickly and isolate our changes via canary and blue-green deployments. Of course, following 12-factors were a big important principle for us to manage those concerns. We have used Spring Cloud Stream to create an event-driven data processing pipeline for this matter and some other complementary frameworks provided by Pivotal. What has come to my mind is, if other customer experiences of using Metron in production were similar to our experience and they had had the same sort of concerns, can migrating from Storm to Event-Driven Pipeline help all users to have a better experience with running Metron in production? Of course, I have not been across other user challenges so I cannot answer that, but it is just an idea. There is no doubt that we can have all these features by using Spark as well in future, but it requires more time to build the integration and some of these functionalities are not going to be available very soon. It is just a thought that the Metron architecture is already Event-Driven at some stages and state-less by nature. Which makes it a good fit for using an event-driven pipeline to deploy it on containers. Cheers, Ali