Hi Ali,

Thank you for taking the time to share your experiences with us. I've been
thinking about this a while now and wanted to take some time for reflection
before responding. I need to kick out a proper dev list DISCUSS thread on
this, but if you've seen a couple of the recent refactoring PR's, you are
right that we've been looking to decouple ourselves from Storm and open up
the possibility of onboarding another processing engine. The most obvious
candidate here, imho, is Apache Spark. Getting right to the meat of your
discussion points, I don't think this is an either/or proposition between
Kubernetes and Spark. I see this as an AND proposition. The reality is that
Spark offers quite a bit from a job scaling, redundancy, and efficiency
perspective. Not to mention, the capabilities it provides purely from a
data transformation and processing engine perspective. The real roadmap, at
least in my mind, would be for us to onboard Spark and then leverage
Kubernetes at some point to enable some of the features that you describe -
vertical and horizontal elasticity, in particular. In addition to that,
Helm could provide some compelling features for managing that container
application deployment story. Expect a discussion from me very soon about
more specific ideas as to what I think our integration with Spark can and
should look like in the near future with Metron. We have nearly completed
decoupling our core infrastructure from Storm at this point, which opens us
up to a number of possibilities going forward.

Best,
Mike Miklavcic


On Thu, Apr 4, 2019 at 1:35 AM Ali Nazemian <alinazem...@gmail.com> wrote:

> Hi All,
>
> As far as I understood, there is a plan to change the real-time engine of
> Metron due to some issues that user and developer have been facing with it.
> I would like to explain some critical issues that customer have been facing
> to clarify it for the development team what the best approach could be for
> the future of Metron. Based on the experience we have had with Metron there
> are two important issues that cause lots of problems from the technology
> and business:
>
> - Infrastructure cost
> - Operational complexity
>
> We have had lots of issues to minimize infrastructure cost. We have also
> spent significant time to tune infrastructure to be able to reduce the
> cost. However, regardless of what had been done, we were not able to manage
> our cost properly. The main reason for that is the rate of log ingestion
> has been very fluctuating. It means we were receiving 4k eps on a sensor
> during the peak time and less than 1 eps off-peak (e.g. during night). The
> problem with that is you want to have an environment that can easily *scale
> up* and *scale down* based on your ingestion traffic. Not to mention that
> there have been situations where we cannot even predict the ingestion rate
> as there has been a sort of cyber attach where lots of logs are generated
> from the source devices. For example, DDOS might be one of the scenarios
> that lots of logs are generated.
>
> When it comes to operational complexity, we have had lots of issues to
> manage sensors and tune different parameters based on the traffic we
> receive. We have had lots of failures as well due to different reasons and
> we spent a fair amount of time to write scripts that can be simulated
> *self-healing* feature at a very basic level. In the production use case,
> we need to be able to respond to different situations very quickly. For
> example, if a service is down, bring it up automatically or if a new sensor
> is onboarded make sure that there won't be any risk to other services. We
> also have lots of discussion about how we can create different processes or
> automation tests to make sure nothing can go wrong. However, this made us
> to create lots of platforms to test something from different aspects which
> increases our cost even more. We didn't have the capability of provisioning
> a short-lived environment once a PR is submitted. We really miss an ability
> to *provision an environment very quickly*. We really needed to have a
> capability to isolate different sensors and different use cases entirely
> from not only parser topology, but also enrichment and indexing topologies.
> We needed a good mechanism for *change isolation*.
>
> I understand that the requirements of running an application on Cloud would
> be different than on-premise. However, the majority of them are quite the
> same when it comes to running Metron in production.
>
> We have recently delivered a data processing pipeline project using more
> cloud-native architectures and we have found out that how similar the
> concerns have been and how easily Kubernetes helped us to manage these
> problems with providing native support for scale-up and scale-downs,
> self-healing, being able to provision a short-lived environment very
> quickly and isolate our changes via canary and blue-green deployments. Of
> course, following 12-factors were a big important principle for us to
> manage those concerns. We have used Spring Cloud Stream to create an
> event-driven data processing pipeline for this matter and some other
> complementary frameworks provided by Pivotal. What has come to my mind is,
> if other customer experiences of using Metron in production were similar to
> our experience and they had had the same sort of concerns, can migrating
> from Storm to Event-Driven Pipeline help all users to have a better
> experience with running Metron in production? Of course, I have not been
> across other user challenges so I cannot answer that, but it is just an
> idea.
>
> There is no doubt that we can have all these features by using Spark as
> well in future, but it requires more time to build the integration and some
> of these functionalities are not going to be available very soon. It is
> just a thought that the Metron architecture is already Event-Driven at some
> stages and state-less by nature. Which makes it a good fit for using an
> event-driven pipeline to deploy it on containers.
>
> Cheers,
> Ali
>

Reply via email to