[jira] [Commented] (NIFI-12236) Improving fault tolerancy of the QuestDB backed metrics repository

Pierre Villard (Jira) Wed, 06 Dec 2023 04:45:07 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793687#comment-17793687
 ]


Pierre Villard commented on NIFI-12236:
---------------------------------------

I'd like to add my 2ct here:

This feature is already used by some of the NiFi users and has proved to be 
extremely useful when it comes to debug problematic situations. It's really not 
about providing a long term monitoring solution but more about being able to 
troubleshoot things in case something bad happened. In particular, when 
something bad happens and a node is restarted, all of the data, with the 
current default implementation is lost, and this is making things really 
complicated for debugging/troubleshooting.

Regarding Datadog/Prometheus, as someone passionate about NiFi's monitoring, I 
definitely agree that anyone using NiFi in production should be deploying NiFi 
and have those solutions for the monitoring of the NiFi service. However, what 
we're talking about here is very different (in my opinion). We all know that 
NiFi users are most of the time using NiFi in a very "multi tenant" approach 
(ie. many different use cases / process groups running in the same NiFi 
environment), while the Prometheus endpoint and the reporting tasks are great 
to report high level monitoring metrics to tools where you'll have very 
advanced dashboarding capabilities, it's a completely different story to have 
"per use case" monitoring. And even if you're sending everything into something 
like Prometheus, doing the monitoring dashboards per use case will be quite 
some work. And I think this would be the same if we say: please provide your 
database and we'll push data there and then it's up to you to build dashboards 
on top of that.

As far as I'm concerned and just to give a single example, when trying to look 
at performance optimizations in a flow, I may want to look at the graph for a 
specific processor showing the metrics average task duration / average lineage 
duration. This level of detail is something that would be quite hard to get if 
not offered by NiFi. Having a capability persisting all of this data across 
restarts is really useful.

While I can definitely accept the concerns around making it the default for 
NiFi 2.0, I think the work in the corresponding PR is a tremendous amount of 
work that would be valuable for the NiFi users.

> Improving fault tolerancy of the QuestDB backed metrics repository
> ------------------------------------------------------------------
>
>                 Key: NIFI-12236
>                 URL: https://issues.apache.org/jira/browse/NIFI-12236
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Simon Bence
>            Assignee: Simon Bence
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Based on the related discussion on the dev email list, the QuestDB handling 
> of the metrics repository needs to be improved to have better fault tolerance 
> in order to be possible to use as a viable option for default metrics data 
> store. This should primarily focus on handling unexpeted database events like 
> corrupted database or loss of space on the disk. Any issues should be handled 
> with an attempt to keep the database service healthy but in case of that is 
> impossible, the priority is to keep NiFi and the core services running, even 
> with the price of metrics collection / presentation outage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12236) Improving fault tolerancy of the QuestDB backed metrics repository

Reply via email to