[ 
https://issues.apache.org/jira/browse/NIFI-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099055#comment-16099055
 ] 

Michael Moser commented on NIFI-3376:
-------------------------------------

I don't think the degree of difficulty should stop us from giving a good faith 
effort to resolve issues like this.  When you have a 99% full content_repo, 
it's not a good user experience to wonder what NiFi could possibly be keeping 
in there.  I agree that compaction is going to be a hard problem to solve, but 
if we can't work out an alternative that gets us most of the way there, then we 
should seriously reconsider moving forward with it.

So, in the interest of an 80% solution, what does everyone think of this 
proposal?  The way I currently understand it, if we've placed 99 files (for 
example) into a single content claim but it has not yet reached 
MAX_APPENDABLE_CLAIM_LENGTH, and the next file we want to store is very large, 
then that large file could become the 100th file in that claim.  The deletion 
of that large file from content_repo is now dependent on the disposition of the 
99 smaller files.  Can we always just write files that are larger than 
MAX_APPENDABLE_CLAIM_LENGTH into their own claim?  Now that 
MAX_APPENDABLE_CLAIM_LENGTH is configurable again, NiFi users would have fine 
control over this behavior, and it's easy to explain.

> Implement content repository ResourceClaim compaction
> -----------------------------------------------------
>
>                 Key: NIFI-3376
>                 URL: https://issues.apache.org/jira/browse/NIFI-3376
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Michael Moser
>            Assignee: Michael Hogue
>
> On NiFi systems that deal with many files whose size is less than 1 MB, we 
> often see that the actual disk usage of the content_repository is much 
> greater than the size of flowfiles that NiFi reports are in its queues.  As 
> an example, NiFi may report "50,000 / 12.5 GB" but the content_repository 
> takes up 240 GB of its file system.  This leads to scenarios where a 500 GB 
> content_repository file system gets 100% full, but "I only had 40 GB of data 
> in my NiFi!"
> When several content claims exist in a single resource claim, and most but 
> not all content claims are terminated, the entire resource claim is still not 
> eligible for deletion or archive.  This could mean that only one 10 KB 
> content claim out of a 1 MB resource claim is counted by NiFi as existing in 
> its queues.
> If a particular flow has a slow egress point where flowfiles could back up 
> and remain on the system longer than expected, this problem is exacerbated.
> A potential solution is to compact resource claim files on disk. A background 
> thread could examine all resource claims, and for those that get "old" and 
> whose active content claim usage drops below a threshold, then rewrite the 
> resource claim file.
> A potential work-around is to allow modification of the FileSystemRepository 
> MAX_APPENDABLE_CLAIM_LENGTH to make it a smaller number.  This would increase 
> the probability that the content claims reference count in a resource claim 
> would reach 0 and the resource claim becomes eligible for deletion/archive.  
> Let users trade-off performance for more accurate accounting of NiFi queue 
> size to content repository size.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to