Everything is at the default locations for these nifis. On one of the two machines, I did find log messages like you suggested: 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile Repository due to java.io.FileNotFoundException: ./flowfile_repository/partition-14/3169.journal (No space left on device)
I added the logger, which apparently takes effect right away. What am I looking for in this logs? I see a lot of stuff like: 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893660-3174, container=default, section=102] to 0 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555922818-3275, container=default, section=203] to 191 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555880393-3151, container=default, section=79] to 142 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555872053-3146, container=default, section=74] to 441 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893954-3178, container=default, section=106] to 2 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for StandardResourceClaim[id=1481555893696-3175, container=default, section=103] to 1 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims to [] What's puzzling to me is that both of these machines have > 100GB of free space, and I have never seen the queued size go above 20GB. It seems to me like it gets into a state where nothing is deleted long before it runs out of disk space. Thanks, Alan On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com> wrote: > Alan, > > Thanks for the thread-dump and the in-depth analysis! > > So in terms of the two tasks there, here's a quick explanation of what > each does: > ArchiveOrDestroyDestructableClaims - When a Resource Claim (which maps to > a file on disk) is no longer referenced > by any FlowFile, it can be either archived or destroyed (depending on > whether the property in nifi.properties has archiving > enabled). > DestroyExpiredArchiveClaims - When archiving is enabled, the Resource > Claims that are archived have to eventually > age off. This task is responsible for ensuring that this happens. > > As you mentioned, in the Executor, if the Runnable fails it will stop > running forever, and if the thread gets stuck, another will > not be launched. Neither of these appears to be the case. I say this > because both of those Runnables are wrapped entirely > within a try { ... } catch (Throwable t) {...}. So the method will never > end Exceptionally. Also, the thread dump shows all of the > threads created by that Thread Pool (those whose names begin with > "FileSystemRepository Workers Thread-") in WAITING > or TIMED_WAITING state. This means that they are sitting in the Executor > waiting to be scheduled to do something else, > so they aren't stuck in any kind of infinite loop or anything like that. > > Now, with all of that being said, I have a theory as to what could perhaps > be happening :) > > From the configuration that you listed below, it shows that the content > repository is located at ./content_repository, which is > the default. Is the FlowFile Repository also located at the default > location of ./flowfile_repository? The reason that I ask is this: > > When I said above that a Resource Claim is marked destructible when no > more FlowFiles reference it, that was a bit of a > simplification. A more detailed explanation is this: when the FlowFile > Repository is checkpointed (this happens every 2 minutes > by default), its Write-Ahead Log is "rolled over" (or "checkpointed" or > "compacted" or however you like to refer to it). When this > happens, we do an fsync() to ensure that the data is stored safely on > disk. Only then do we actually mark a claim as destructible. > This is done in order to ensure that if there is a power outage and a > FlowFile Repository update wasn't completely flushed to disk, > that we can recover. For instance, if the content of a FlowFile changes > from Resource Claim A to Resource Claim B and as a result > we delete Resource Claim A and then lose power, it's possible that the > FlowFile Repository didn't flush that update to disk; as a result, > on restart, we may still have that FlowFile pointing to Resource Claim A > which is now deleted, so we would end up having data loss. > This method of only deleting Resource Claims after the FlowFile Repository > has been fsync'ed means that we know on restart that > Resource Claim A won't still be referenced. > > So that was probably a very wordy, verbose description of what happens but > I'm trying to make sure that I explain things adequately. > So with that background... if you are storing your FlowFile Repository on > the same volume as your Content Repository, the following > could happen: > > At some point in time, enough data is queued up in your flow for you to > run out of disk space. As a result, the FlowFile Repository is > unable to be compacted. Since this is not happening, it will not mark any > of the Resource Claims as destructible. This would mean that > the Content Repository does not get cleaned up. So now you've got a full > Content Repository and it's unable to clean up after itself, because > no Resource Claims are getting marked as destructible. > > So to prove or disprove this theory, there are a few things that you can > look at: > > Do you see the following anywhere in your logs: Unable to checkpoint > FlowFile Repository > > If you add the following to your conf/logback.xml: > <logger > name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager" > level="DEBUG" /> > Then that should allow you to see a DEBUG-level log message every time > that a Resource Claim is marked destructible and every time > that the Content Repository requests the collection of Destructible Claims > ("Drained 100 destructable claims" for instance) > > Any of the logs related to those statements should be very valuable in > determining what's going on. > > Thanks again for all of the detailed analysis. Hopefully we can get this > all squared away and taken care of quickly! > > -Mark > > > On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto:ala > n...@cloudera.com>> wrote: > > Here is what I have figured out so far. > > The cleanups are scheduled at https://github.com/apache/ > nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/ > nifi-framework/nifi-framework-core/src/main/java/org/apache/ > nifi/controller/repository/FileSystemRepository.java#L232 > > I'm not totally sure which one of those is the one that should be cleaning > things up. It's either ArchiveOrDestroyDestructableClaims or > DestroyExpiredArchiveClaims, both of which are in that class, and both of > which are scheduled with scheduleWithFixedDelay. Based on docs at > https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ > ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay( > java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) if > those methods fail once, they will stop running forever. Also if the thread > got stuck it wouldn't launch a new one. > > I then hoped I would go into the logs, see a failure, and use it to figure > out the issue. > > What I'm seeing instead is things like this, which comes from > BinDestructableClaims: > 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] > o.a.n.c.repository.FileSystemRepository Deleted 159 files from archive > for Container default; oldest Archive Date is now Sat Dec 10 22:09:53 PST > 2016; container cleanup took 34266 millis > that are somewhat frequent (as often as once per second, which is the > scheduling frequency). Then, eventually, they just stop. Unfortunately > there isn't an error message I can find that's killing these. > > At nifi startup, I see messages like this, which come from something (not > sure what yet) calling the cleanup() method on FileSystemRepository: > 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.FileSystemRepository > Found unknown file > /home/cops/edh-bundle-extractor/content_repository/0/1481467667784-2048 > (1749645 bytes) in File System Repository; removing file > I never see those after the initial cleanup that happens on restart. > > I attached a thread dump. I noticed at the top that there is a cleanup > thread parked. I took 10 more thread dumps after this and in every one of > them the cleanup thread was parked. That thread looks like it corresponds > to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that > if the cleanup task I need were running, it would be in one of the > FileSystemRepository Workers. However, in all of my thread dumps, these > were always all parked. > > Attached one of the thread dumps. > > Thanks, > Alan > > > On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com<mailto: > marka...@hotmail.com>> wrote: > Alan, > > > It's possible that you've run into some sort of bug that is preventing > > it from cleaning up the Content Repository properly. While it's stuck > > in this state, could you capture a thread dump (bin/nifi.sh dump > thread-dump.txt)? > > That would help us determine if there is something going on that is > > preventing the cleanup from happening. > > > Thanks > > -Mark > > > ________________________________ > From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> > Sent: Sunday, December 11, 2016 11:11 AM > To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> > Subject: Re: Content Repository Cleanup > > This just filled up again even > with nifi.content.repository.archive.enabled=false. > > On the node that is still alive, our queued flowfiles are 91 / 16.47 GB, > but the content repository directory is using 646 GB. > > Is there a property I can set to make it clean things up more frequently? I > expected that once I turned archive enabled off, it would delete things > from the content repository as soon as the flow files weren't queued > anywhere. So far the only way I have found to reliably get nifi to clear > out the content repository is to restart it. > > Our version string is the following, if that interests you: > 11/26/2016 04:39:37 PST > Tagged nifi-1.1.0-RC2 > From ${buildRevision} on branch ${buildBranch} > > Maybe we will go to the released 1.1 and see if that helps. Until then I'll > be restarting a lot and digging into the code to figure out where this > cleanup is supposed to happen. Any pointers on code/configs for that would > be appreciated. > > Thanks, > Alan > > On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com<mailto: > jgres...@gmail.com>> wrote: > > > No, in my scenario a server restart would not affect the content > repository > > size. > > > > On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> wrote: > > > > > If we were in the situation Joe G described, should we expect that when > > we > > > kill and restart nifi it would clean everything up? That behavior has > > been > > > consistent every time - when the disk hits 100%, we kill nifi, delete > > > enough old content files to bring it back up, and before it bring the > UI > > up > > > it deletes things to get within the archive policy again. That sounds > > less > > > like the files are stuck and more like it failed trying. > > > > > > For now I just turned off archiving, since we don't really need it for > > > this use case. > > > > > > I attached a jstack from last night's failure, which looks pretty > boring > > > to me. > > > > > > On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> > > wrote: > > > > > >> The scenario Joe G describes is almost exactly what we are doing. We > > >> bring in large files and unpack them into many smaller ones. In the > most > > >> recent iteration of this problem, I saw that we had many small files > > queued > > >> up at the time trouble was happening. We will try your suggestion to > > see if > > >> the situation improves. > > >> > > >> Thanks, > > >> Alan > > >> > > >> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com > <mailto:jgres...@gmail.com>> > > wrote: > > >> > > >>> Not sure if your scenario is related, but one of the NiFi devs > recently > > >>> explained to me that the files in the content repository are actually > > >>> appended together with other flow file content (please correct me if > > I'm > > >>> explaining it wrong). That means if you have many small flow files > in > > >>> your > > >>> current backlog, and several large flow files have recently left the > > >>> flow, > > >>> the large ones could still be hanging around in the content > repository > > as > > >>> long as the small ones are still there, if they're in the same > appended > > >>> files on disk. > > >>> > > >>> This scenario recently happened to us: we had a flow with ~20 million > > >>> tiny > > >>> flow files queued up, and at the same time we were also processing a > > >>> bunch > > >>> of 1GB files, which left the flow quickly. The content repository > was > > >>> much > > >>> larger than what was actually being reported in the flow stats, and > our > > >>> disks were almost full. On a hunch, I tried the following strategy: > > >>> - MergeContent the tiny flow files using flow-file-v3 format (to > > capture > > >>> all attributes) > > >>> - MergeContent 10,000 of the packaged flow files using tar format for > > >>> easier storage on disk > > >>> - PutFile into a directory > > >>> - GetFile from the same directory, but using back pressure from here > on > > >>> out > > >>> (so that the flow simply wouldn't pull the same files from disk until > > it > > >>> was really ready for them) > > >>> - UnpackContent (untar them) > > >>> - UnpackContent (turn them back into flow files with the original > > >>> attributes) > > >>> - Then do the processing they were originally designed for > > >>> > > >>> This had the effect of very quickly reducing the size of my content > > >>> repository to very nearly the actual size I saw reported in the flow, > > and > > >>> my disk usage dropped from ~95% to 50%, which is the configured > content > > >>> repository max usage percentage. I haven't had any problems since. > > >>> > > >>> Hope this helps. > > >>> Joe > > >>> > > >>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com > <mailto:joe.w...@gmail.com>> wrote: > > >>> > > >>> > Alan, > > >>> > > > >>> > That retention percentage only has to do with the archive of data > > >>> > which kicks in once a given chunk of content is no longer reachable > > by > > >>> > active flowfiles in the flow. For it to grow to 100% typically > would > > >>> > mean that you have data backlogged in the flow that account for > that > > >>> > much space. If that is certainly not the case for you then we need > > to > > >>> > dig deeper. If you could do screenshots or share log files and > stack > > >>> > dumps around this time those would all be helpful. If the > > screenshots > > >>> > and such are too sensitive please just share as much as you can. > > >>> > > > >>> > Thanks > > >>> > Joe > > >>> > > > >>> > On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway <al...@cloudera.com > <mailto:al...@cloudera.com>> > > >>> wrote: > > >>> > > One other note on this, when it came back up there were tons of > > >>> messages > > >>> > > like this: > > >>> > > > > >>> > > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. > > >>> > FileSystemRepository > > >>> > > Found unknown file /path/to/content_repository/49 > > >>> 8/1481329796415-87538 > > >>> > > (1071114 bytes) in File System Repository; archiving file > > >>> > > > > >>> > > I haven't dug into what that means. > > >>> > > Alan > > >>> > > > > >>> > > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < > al...@cloudera.com<mailto:al...@cloudera.com>> > > >>> > wrote: > > >>> > > > > >>> > >> Hello, > > >>> > >> > > >>> > >> We have a node on which nifi content repository keeps growing to > > use > > >>> > 100% > > >>> > >> of the disk. It's a relatively high-volume process. It chewed > > >>> through > > >>> > more > > >>> > >> than 100GB in the three hours between when we first saw it hit > > 100% > > >>> of > > >>> > the > > >>> > >> disk and when we just cleaned it up again. > > >>> > >> > > >>> > >> We are running nifi 1.1 for this. Our nifi.properties looked > like > > >>> this: > > >>> > >> > > >>> > >> nifi.content.repository.implementation=org.apache. > > >>> > >> nifi.controller.repository.FileSystemRepository > > >>> > >> nifi.content.claim.max.appendable.size=10 MB > > >>> > >> nifi.content.claim.max.flow.files=100 > > >>> > >> nifi.content.repository.directory.default=./content_repository > > >>> > >> nifi.content.repository.archive.max.retention.period=12 hours > > >>> > >> nifi.content.repository.archive.max.usage.percentage=50% > > >>> > >> nifi.content.repository.archive.enabled=true > > >>> > >> nifi.content.repository.always.sync=false > > >>> > >> > > >>> > >> I just bumped retention period down to 2 hours, but should max > > usage > > >>> > >> percentage protect us from using 100% of the disk? > > >>> > >> > > >>> > >> Unfortunately we didn't get jstacks on either failure. If it > hits > > >>> 100% > > >>> > >> again I will make sure to get that. > > >>> > >> > > >>> > >> Thanks, > > >>> > >> Alan > > >>> > >> > > >>> > > > >>> > > >>> > > >>> > > >>> -- > > >>> I know what it is to be in need, and I know what it is to have > > plenty. I > > >>> have learned the secret of being content in any and every situation, > > >>> whether well fed or hungry, whether living in plenty or in want. I > can > > >>> do > > >>> all this through him who gives me strength. *-Philippians 4:12-13* > > >>> > > >> > > >> > > > > > > > > > -- > > I know what it is to be in need, and I know what it is to have plenty. I > > have learned the secret of being content in any and every situation, > > whether well fed or hungry, whether living in plenty or in want. I can > do > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > <thread-dump.txt> > >