[
https://issues.apache.org/jira/browse/NIFI-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793687#comment-17793687
]
Pierre Villard commented on NIFI-12236:
---------------------------------------
I'd like to add my 2ct here:
This feature is already used by some of the NiFi users and has proved to be
extremely useful when it comes to debug problematic situations. It's really not
about providing a long term monitoring solution but more about being able to
troubleshoot things in case something bad happened. In particular, when
something bad happens and a node is restarted, all of the data, with the
current default implementation is lost, and this is making things really
complicated for debugging/troubleshooting.
Regarding Datadog/Prometheus, as someone passionate about NiFi's monitoring, I
definitely agree that anyone using NiFi in production should be deploying NiFi
and have those solutions for the monitoring of the NiFi service. However, what
we're talking about here is very different (in my opinion). We all know that
NiFi users are most of the time using NiFi in a very "multi tenant" approach
(ie. many different use cases / process groups running in the same NiFi
environment), while the Prometheus endpoint and the reporting tasks are great
to report high level monitoring metrics to tools where you'll have very
advanced dashboarding capabilities, it's a completely different story to have
"per use case" monitoring. And even if you're sending everything into something
like Prometheus, doing the monitoring dashboards per use case will be quite
some work. And I think this would be the same if we say: please provide your
database and we'll push data there and then it's up to you to build dashboards
on top of that.
As far as I'm concerned and just to give a single example, when trying to look
at performance optimizations in a flow, I may want to look at the graph for a
specific processor showing the metrics average task duration / average lineage
duration. This level of detail is something that would be quite hard to get if
not offered by NiFi. Having a capability persisting all of this data across
restarts is really useful.
While I can definitely accept the concerns around making it the default for
NiFi 2.0, I think the work in the corresponding PR is a tremendous amount of
work that would be valuable for the NiFi users.
> Improving fault tolerancy of the QuestDB backed metrics repository
> ------------------------------------------------------------------
>
> Key: NIFI-12236
> URL: https://issues.apache.org/jira/browse/NIFI-12236
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Simon Bence
> Assignee: Simon Bence
> Priority: Major
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Based on the related discussion on the dev email list, the QuestDB handling
> of the metrics repository needs to be improved to have better fault tolerance
> in order to be possible to use as a viable option for default metrics data
> store. This should primarily focus on handling unexpeted database events like
> corrupted database or loss of space on the disk. Any issues should be handled
> with an attempt to keep the database service healthy but in case of that is
> impossible, the priority is to keep NiFi and the core services running, even
> with the price of metrics collection / presentation outage.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)