I agree the limits sound low and will address that tomorrow. I'm not seeing FileNotFound or NoSuchFile.
Here's an example file: grep 1481763927251 logs/nifi-app.log 2016-12-14 17:05:27,277 DEBUG [Timer-Driven Process Thread-36] o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 1 2016-12-14 17:05:27,357 DEBUG [Timer-Driven Process Thread-2] o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 2 2016-12-14 17:05:27,684 DEBUG [Timer-Driven Process Thread-36] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 1 2016-12-14 17:05:27,732 DEBUG [Timer-Driven Process Thread-2] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 0 2016-12-14 17:05:27,909 DEBUG [Timer-Driven Process Thread-14] o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 1 2016-12-14 17:05:27,945 DEBUG [Timer-Driven Process Thread-14] o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 2 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 1 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481763927251-1, container=default, section=1] to 0 This nifi-app.log covers a period when the nifi only handled two sets of files for a total of maybe 10GB uncompressed. Content repository went over 100GB in that time. I checked a few content repository files, and they all had similar patterns - claims hit 0 twice - once around 17:05 and once around 17:14, then nothing. I brought down the nifi around 17:30. During that time, we did have a processor hitting OutOfMemory while unpacking a 1GB file. I'm adjusting the heap to try to make that succeed in case that was related. On Wed, Dec 14, 2016 at 8:32 PM, Mark Payne <marka...@hotmail.com> wrote: > OK, so these are generally the default values for most linux systems. > These are a little low, > though for what NiFi recommends and often needs. With these settings, you > can easily run > out of open file handles. When this happens, trying to access a file will > return a FileNotFoundException > even though the file exists and permissions all look good. As a result, > NiFi may be failing to > delete the data simply because it can't get an open file handle. > > The admin guide [1] explains the best practices for configuring these > settings. Generally, after updating > these settings, I think you have to logout of the machine and login again > for the changes to take effect. > Would recommend you update these settings and also search logs for > "FileNotFound" as well as > "NoSuchFile" and see if that hits anywhere. > > > [1] http://nifi.apache.org/docs/nifi-docs/html/administration- > guide.html#configuration-best-practices > > > On Dec 14, 2016, at 8:25 PM, Alan Jackoway <al...@cloudera.com<mailto:ala > n...@cloudera.com>> wrote: > > We haven't let the disk hit 100% in a while, but it's been crossing 90%. We > haven't seen the "Unable to checkpoint" message in the last 24 hours. > > $ ulimit -Hn > 4096 > $ ulimit -Sn > 1024 > > I will work on tracking a specific file next. > > > On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com<mailto: > al...@cloudera.com>> wrote: > > At first I thought that the drained messages always said 0, but that's not > right. What should the total number of claims drained be? The number of > flowfiles that made it through the system? If so, I think our number is > low: > > $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14* | grep > -v "Drained 0" | awk '{sum += $9} END {print sum}' > 25296 > > I'm not sure how to get the count of flowfiles that moved through, but I > suspect that's low by an order of magnitude. That instance of nifi has > handled 150k files in the last 6 hours, most of which went through a number > of processors and transformations. > > Should the number of drained claims correspond to the number of flow files > that moved through the system? > Alan > > On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com<mailto: > al...@cloudera.com>> wrote: > > Some updates: > * We fixed the issue with missing transfer relationships, and this did > not go away. > * We saw this a few minutes ago when the queue was at 0. > > What should I be looking for in the logs to figure out the issue? > > Thanks, > Alan > > On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> > wrote: > > In case this is interesting, I think this started getting bad when we > started hitting an error where some of our files were not given a transfer > relationship. Maybe some combination of not giving flow files a > relationship and the subsequent penalization is causing the problem. > > On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> > wrote: > > Everything is at the default locations for these nifis. > > On one of the two machines, I did find log messages like you suggested: > 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] > o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile > Repository due to java.io.FileNotFoundException: > ./flowfile_repository/partition-14/3169.journal (No space left on > device) > > I added the logger, which apparently takes effect right away. What am I > looking for in this logs? I see a lot of stuff like: > 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555893660-3174, container=default, > section=102] to 0 > 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555922818-3275, container=default, > section=203] to 191 > 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555880393-3151, container=default, > section=79] to 142 > 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555872053-3146, container=default, > section=74] to 441 > 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555893954-3178, container=default, > section=106] to 2 > 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] > o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count > for StandardResourceClaim[id=1481555893696-3175, container=default, > section=103] to 1 > 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] > o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims > to [] > > What's puzzling to me is that both of these machines have > 100GB of > free space, and I have never seen the queued size go above 20GB. It seems > to me like it gets into a state where nothing is deleted long before it > runs out of disk space. > > Thanks, > Alan > > On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com<mailto:m > arka...@hotmail.com>> > wrote: > > Alan, > > Thanks for the thread-dump and the in-depth analysis! > > So in terms of the two tasks there, here's a quick explanation of what > each does: > ArchiveOrDestroyDestructableClaims - When a Resource Claim (which > maps to a file on disk) is no longer referenced > by any FlowFile, it can be either archived or destroyed (depending on > whether the property in nifi.properties has archiving > enabled). > DestroyExpiredArchiveClaims - When archiving is enabled, the Resource > Claims that are archived have to eventually > age off. This task is responsible for ensuring that this happens. > > As you mentioned, in the Executor, if the Runnable fails it will stop > running forever, and if the thread gets stuck, another will > not be launched. Neither of these appears to be the case. I say this > because both of those Runnables are wrapped entirely > within a try { ... } catch (Throwable t) {...}. So the method will > never end Exceptionally. Also, the thread dump shows all of the > threads created by that Thread Pool (those whose names begin with > "FileSystemRepository Workers Thread-") in WAITING > or TIMED_WAITING state. This means that they are sitting in the > Executor waiting to be scheduled to do something else, > so they aren't stuck in any kind of infinite loop or anything like > that. > > Now, with all of that being said, I have a theory as to what could > perhaps be happening :) > > From the configuration that you listed below, it shows that the > content repository is located at ./content_repository, which is > the default. Is the FlowFile Repository also located at the default > location of ./flowfile_repository? The reason that I ask is this: > > When I said above that a Resource Claim is marked destructible when no > more FlowFiles reference it, that was a bit of a > simplification. A more detailed explanation is this: when the FlowFile > Repository is checkpointed (this happens every 2 minutes > by default), its Write-Ahead Log is "rolled over" (or "checkpointed" > or "compacted" or however you like to refer to it). When this > happens, we do an fsync() to ensure that the data is stored safely on > disk. Only then do we actually mark a claim as destructible. > This is done in order to ensure that if there is a power outage and a > FlowFile Repository update wasn't completely flushed to disk, > that we can recover. For instance, if the content of a FlowFile > changes from Resource Claim A to Resource Claim B and as a result > we delete Resource Claim A and then lose power, it's possible that the > FlowFile Repository didn't flush that update to disk; as a result, > on restart, we may still have that FlowFile pointing to Resource Claim > A which is now deleted, so we would end up having data loss. > This method of only deleting Resource Claims after the FlowFile > Repository has been fsync'ed means that we know on restart that > Resource Claim A won't still be referenced. > > So that was probably a very wordy, verbose description of what happens > but I'm trying to make sure that I explain things adequately. > So with that background... if you are storing your FlowFile Repository > on the same volume as your Content Repository, the following > could happen: > > At some point in time, enough data is queued up in your flow for you > to run out of disk space. As a result, the FlowFile Repository is > unable to be compacted. Since this is not happening, it will not mark > any of the Resource Claims as destructible. This would mean that > the Content Repository does not get cleaned up. So now you've got a > full Content Repository and it's unable to clean up after itself, because > no Resource Claims are getting marked as destructible. > > So to prove or disprove this theory, there are a few things that you > can look at: > > Do you see the following anywhere in your logs: Unable to checkpoint > FlowFile Repository > > If you add the following to your conf/logback.xml: > <logger name="org.apache.nifi.controller.repository.claim. > StandardResourceClaimManager" > level="DEBUG" /> > Then that should allow you to see a DEBUG-level log message every time > that a Resource Claim is marked destructible and every time > that the Content Repository requests the collection of Destructible > Claims ("Drained 100 destructable claims" for instance) > > Any of the logs related to those statements should be very valuable in > determining what's going on. > > Thanks again for all of the detailed analysis. Hopefully we can get > this all squared away and taken care of quickly! > > -Mark > > > On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto:ala > n...@cloudera.com><mailto: > al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: > > Here is what I have figured out so far. > > The cleanups are scheduled at https://github.com/apache/nifi > /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra > mework/nifi-framework-core/src/main/java/org/apache/nifi/con > troller/repository/FileSystemRepository.java#L232 > > I'm not totally sure which one of those is the one that should be > cleaning things up. It's either ArchiveOrDestroyDestructableClaims or > DestroyExpiredArchiveClaims, both of which are in that class, and both of > which are scheduled with scheduleWithFixedDelay. Based on docs at > https://docs.oracle.com/javase/7/docs/api/java/util/concurre > nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j > ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) > if those methods fail once, they will stop running forever. Also if the > thread got stuck it wouldn't launch a new one. > > I then hoped I would go into the logs, see a failure, and use it to > figure out the issue. > > What I'm seeing instead is things like this, which comes from > BinDestructableClaims: > 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] > o.a.n.c.repository.FileSystemRepository Deleted 159 files from > archive for Container default; oldest Archive Date is now Sat Dec 10 > 22:09:53 PST 2016; container cleanup took 34266 millis > that are somewhat frequent (as often as once per second, which is the > scheduling frequency). Then, eventually, they just stop. Unfortunately > there isn't an error message I can find that's killing these. > > At nifi startup, I see messages like this, which come from something > (not sure what yet) calling the cleanup() method on FileSystemRepository: > 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository. > FileSystemRepository > Found unknown file /home/cops/edh-bundle-extracto > r/content_repository/0/1481467667784-2048 (1749645 bytes) in File > System Repository; removing file > I never see those after the initial cleanup that happens on restart. > > I attached a thread dump. I noticed at the top that there is a cleanup > thread parked. I took 10 more thread dumps after this and in every one of > them the cleanup thread was parked. That thread looks like it corresponds > to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that > if the cleanup task I need were running, it would be in one of the > FileSystemRepository Workers. However, in all of my thread dumps, these > were always all parked. > > Attached one of the thread dumps. > > Thanks, > Alan > > > On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com > <mailto:marka...@hotmail.com>> wrote: > Alan, > > > It's possible that you've run into some sort of bug that is preventing > > it from cleaning up the Content Repository properly. While it's stuck > > in this state, could you capture a thread dump (bin/nifi.sh dump > thread-dump.txt)? > > That would help us determine if there is something going on that is > > preventing the cleanup from happening. > > > Thanks > > -Mark > > > ________________________________ > From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> > Sent: Sunday, December 11, 2016 11:11 AM > To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> > Subject: Re: Content Repository Cleanup > > This just filled up again even > with nifi.content.repository.archive.enabled=false. > > On the node that is still alive, our queued flowfiles are 91 / 16.47 > GB, > but the content repository directory is using 646 GB. > > Is there a property I can set to make it clean things up more > frequently? I > expected that once I turned archive enabled off, it would delete things > from the content repository as soon as the flow files weren't queued > anywhere. So far the only way I have found to reliably get nifi to > clear > out the content repository is to restart it. > > Our version string is the following, if that interests you: > 11/26/2016 04:39:37 PST > Tagged nifi-1.1.0-RC2 > From ${buildRevision} on branch ${buildBranch} > > Maybe we will go to the released 1.1 and see if that helps. Until then > I'll > be restarting a lot and digging into the code to figure out where this > cleanup is supposed to happen. Any pointers on code/configs for that > would > be appreciated. > > Thanks, > Alan > > On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com > <mailto:jgres...@gmail.com>> wrote: > > No, in my scenario a server restart would not affect the content > repository > size. > > On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> wrote: > > If we were in the situation Joe G described, should we expect that > when > we > kill and restart nifi it would clean everything up? That behavior > has > been > consistent every time - when the disk hits 100%, we kill nifi, > delete > enough old content files to bring it back up, and before it bring > the UI > up > it deletes things to get within the archive policy again. That > sounds > less > like the files are stuck and more like it failed trying. > > For now I just turned off archiving, since we don't really need it > for > this use case. > > I attached a jstack from last night's failure, which looks pretty > boring > to me. > > On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> > wrote: > > The scenario Joe G describes is almost exactly what we are doing. > We > bring in large files and unpack them into many smaller ones. In > the most > recent iteration of this problem, I saw that we had many small > files > queued > up at the time trouble was happening. We will try your suggestion > to > see if > the situation improves. > > Thanks, > Alan > > On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com > <mailto:jgres...@gmail.com>> > wrote: > > Not sure if your scenario is related, but one of the NiFi devs > recently > explained to me that the files in the content repository are > actually > appended together with other flow file content (please correct > me if > I'm > explaining it wrong). That means if you have many small flow > files in > your > current backlog, and several large flow files have recently left > the > flow, > the large ones could still be hanging around in the content > repository > as > long as the small ones are still there, if they're in the same > appended > files on disk. > > This scenario recently happened to us: we had a flow with ~20 > million > tiny > flow files queued up, and at the same time we were also > processing a > bunch > of 1GB files, which left the flow quickly. The content > repository was > much > larger than what was actually being reported in the flow stats, > and our > disks were almost full. On a hunch, I tried the following > strategy: > - MergeContent the tiny flow files using flow-file-v3 format (to > capture > all attributes) > - MergeContent 10,000 of the packaged flow files using tar > format for > easier storage on disk > - PutFile into a directory > - GetFile from the same directory, but using back pressure from > here on > out > (so that the flow simply wouldn't pull the same files from disk > until > it > was really ready for them) > - UnpackContent (untar them) > - UnpackContent (turn them back into flow files with the original > attributes) > - Then do the processing they were originally designed for > > This had the effect of very quickly reducing the size of my > content > repository to very nearly the actual size I saw reported in the > flow, > and > my disk usage dropped from ~95% to 50%, which is the configured > content > repository max usage percentage. I haven't had any problems > since. > > Hope this helps. > Joe > > On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com > <mailto:joe.w...@gmail.com>> wrote: > > Alan, > > That retention percentage only has to do with the archive of > data > which kicks in once a given chunk of content is no longer > reachable > by > active flowfiles in the flow. For it to grow to 100% > typically would > mean that you have data backlogged in the flow that account > for that > much space. If that is certainly not the case for you then we > need > to > dig deeper. If you could do screenshots or share log files > and stack > dumps around this time those would all be helpful. If the > screenshots > and such are too sensitive please just share as much as you > can. > > Thanks > Joe > > On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < > al...@cloudera.com<mailto:al...@cloudera.com>> > wrote: > One other note on this, when it came back up there were tons > of > messages > like this: > > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. > FileSystemRepository > Found unknown file /path/to/content_repository/49 > 8/1481329796415-87538 > (1071114 bytes) in File System Repository; archiving file > > I haven't dug into what that means. > Alan > > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < > al...@cloudera.com<mailto:al...@cloudera.com>> > wrote: > > Hello, > > We have a node on which nifi content repository keeps > growing to > use > 100% > of the disk. It's a relatively high-volume process. It > chewed > through > more > than 100GB in the three hours between when we first saw it > hit > 100% > of > the > disk and when we just cleaned it up again. > > We are running nifi 1.1 for this. Our nifi.properties > looked like > this: > > nifi.content.repository.implementation=org.apache. > nifi.controller.repository.FileSystemRepository > nifi.content.claim.max.appendable.size=10 MB > nifi.content.claim.max.flow.files=100 > nifi.content.repository.direct > ory.default=./content_repository > nifi.content.repository.archive.max.retention.period=12 > hours > nifi.content.repository.archive.max.usage.percentage=50% > nifi.content.repository.archive.enabled=true > nifi.content.repository.always.sync=false > > I just bumped retention period down to 2 hours, but should > max > usage > percentage protect us from using 100% of the disk? > > Unfortunately we didn't get jstacks on either failure. If > it hits > 100% > again I will make sure to get that. > > Thanks, > Alan > > > > > > -- > I know what it is to be in need, and I know what it is to have > plenty. I > have learned the secret of being content in any and every > situation, > whether well fed or hungry, whether living in plenty or in > want. I can > do > all this through him who gives me strength. *-Philippians > 4:12-13* > > > > > > > -- > I know what it is to be in need, and I know what it is to have > plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I > can do > all this through him who gives me strength. *-Philippians 4:12-13* > > > <thread-dump.txt> > > > > > > > >