[jira] [Commented] (NIFI-12236) Improving fault tolerancy of the QuestDB backed metrics repository

Joe Witt (Jira) Wed, 06 Dec 2023 07:33:20 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793786#comment-17793786
 ]


Joe Witt commented on NIFI-12236:
---------------------------------

I want to keep the different threads of concern clear.

The idea:
* I think we all agree there is good value in having these data points 
persisted across restarts.  The example Pierre gives is a perfect example of 
why.
* The reality though is our user interface is not designed for data to be held 
for long.  It isnt clear to me from the properties mentioned in PR but what is 
the plan for helping the user configure how long this data is 
retained/queryable?  I see batch size and frequency but not quite sure what 
those would mean relative to retention.  Perhaps that is part of the earlier 
implementation.

The implementation options:
* Offer an embedded state storage mechanism.  QuestDB is one example. A 
database per nifi instance or a database per cluster.  This embedded/batteries 
included mode is quite convenient for the user but we then of course have to do 
it quite thoroughly and consider upgrade scenarios.  I think our pucker factor 
here is far higher given the challenges we've had to work through from H2 in 
the recent year or two.
* Offer the ability to connect to/use a database of the user choosing.  Defer 
installation/durability/security of that database to the user as part of their 
normal database operations/etc..  This I find is more in-line with deployment 
styles we see in the Cloud, or automated with Ansible, or how one would deploy 
in K8S.

This implementation:
* It is a questdb per nifi node.  Does not address the other mode should a user 
want that.  Of note when we offered the Zookeeper embedded mode we also offered 
to connect to a real zookeeper install.  That was to reflect the likely 
non-prod vs prod usage and I think that holds here as well.
* It has a change to the nifi-api.  We should strongly avoid any such API 
changes as part of this activity unless that change has value to various other 
components and the purpose/meaning of that change is very clear.
* It includes a Retry mechanism that the PR suggests might be addressed in a 
later commit.  I don't quite follow what that really means but I recommend 
making this implementation as simple/straight forward as possible.

The default selection:
* We should not be changing the default until this model is proven to be stable 
and recoverable in the same manner to the current in memory implementation.



> Improving fault tolerancy of the QuestDB backed metrics repository
> ------------------------------------------------------------------
>
>                 Key: NIFI-12236
>                 URL: https://issues.apache.org/jira/browse/NIFI-12236
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Simon Bence
>            Assignee: Simon Bence
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Based on the related discussion on the dev email list, the QuestDB handling 
> of the metrics repository needs to be improved to have better fault tolerance 
> in order to be possible to use as a viable option for default metrics data 
> store. This should primarily focus on handling unexpeted database events like 
> corrupted database or loss of space on the disk. Any issues should be handled 
> with an attempt to keep the database service healthy but in case of that is 
> impossible, the priority is to keep NiFi and the core services running, even 
> with the price of metrics collection / presentation outage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12236) Improving fault tolerancy of the QuestDB backed metrics repository

Reply via email to