Some updates: * We fixed the issue with missing transfer relationships, and this did not go away. * We saw this a few minutes ago when the queue was at 0.
What should I be looking for in the logs to figure out the issue? Thanks, Alan On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <[email protected]> wrote: > In case this is interesting, I think this started getting bad when we > started hitting an error where some of our files were not given a transfer > relationship. Maybe some combination of not giving flow files a > relationship and the subsequent penalization is causing the problem. > > On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <[email protected]> > wrote: > >> Everything is at the default locations for these nifis. >> >> On one of the two machines, I did find log messages like you suggested: >> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] >> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile >> Repository due to java.io.FileNotFoundException: >> ./flowfile_repository/partition-14/3169.journal (No space left on device) >> >> I added the logger, which apparently takes effect right away. What am I >> looking for in this logs? I see a lot of stuff like: >> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555893660-3174, container=default, >> section=102] to 0 >> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555922818-3275, container=default, >> section=203] to 191 >> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555880393-3151, container=default, >> section=79] to 142 >> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555872053-3146, container=default, >> section=74] to 441 >> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555893954-3178, container=default, >> section=106] to 2 >> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] >> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >> StandardResourceClaim[id=1481555893696-3175, container=default, >> section=103] to 1 >> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] >> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims >> to [] >> >> What's puzzling to me is that both of these machines have > 100GB of free >> space, and I have never seen the queued size go above 20GB. It seems to me >> like it gets into a state where nothing is deleted long before it runs out >> of disk space. >> >> Thanks, >> Alan >> >> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <[email protected]> wrote: >> >>> Alan, >>> >>> Thanks for the thread-dump and the in-depth analysis! >>> >>> So in terms of the two tasks there, here's a quick explanation of what >>> each does: >>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which maps >>> to a file on disk) is no longer referenced >>> by any FlowFile, it can be either archived or destroyed (depending on >>> whether the property in nifi.properties has archiving >>> enabled). >>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource >>> Claims that are archived have to eventually >>> age off. This task is responsible for ensuring that this happens. >>> >>> As you mentioned, in the Executor, if the Runnable fails it will stop >>> running forever, and if the thread gets stuck, another will >>> not be launched. Neither of these appears to be the case. I say this >>> because both of those Runnables are wrapped entirely >>> within a try { ... } catch (Throwable t) {...}. So the method will never >>> end Exceptionally. Also, the thread dump shows all of the >>> threads created by that Thread Pool (those whose names begin with >>> "FileSystemRepository Workers Thread-") in WAITING >>> or TIMED_WAITING state. This means that they are sitting in the Executor >>> waiting to be scheduled to do something else, >>> so they aren't stuck in any kind of infinite loop or anything like that. >>> >>> Now, with all of that being said, I have a theory as to what could >>> perhaps be happening :) >>> >>> From the configuration that you listed below, it shows that the content >>> repository is located at ./content_repository, which is >>> the default. Is the FlowFile Repository also located at the default >>> location of ./flowfile_repository? The reason that I ask is this: >>> >>> When I said above that a Resource Claim is marked destructible when no >>> more FlowFiles reference it, that was a bit of a >>> simplification. A more detailed explanation is this: when the FlowFile >>> Repository is checkpointed (this happens every 2 minutes >>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" or >>> "compacted" or however you like to refer to it). When this >>> happens, we do an fsync() to ensure that the data is stored safely on >>> disk. Only then do we actually mark a claim as destructible. >>> This is done in order to ensure that if there is a power outage and a >>> FlowFile Repository update wasn't completely flushed to disk, >>> that we can recover. For instance, if the content of a FlowFile changes >>> from Resource Claim A to Resource Claim B and as a result >>> we delete Resource Claim A and then lose power, it's possible that the >>> FlowFile Repository didn't flush that update to disk; as a result, >>> on restart, we may still have that FlowFile pointing to Resource Claim A >>> which is now deleted, so we would end up having data loss. >>> This method of only deleting Resource Claims after the FlowFile >>> Repository has been fsync'ed means that we know on restart that >>> Resource Claim A won't still be referenced. >>> >>> So that was probably a very wordy, verbose description of what happens >>> but I'm trying to make sure that I explain things adequately. >>> So with that background... if you are storing your FlowFile Repository >>> on the same volume as your Content Repository, the following >>> could happen: >>> >>> At some point in time, enough data is queued up in your flow for you to >>> run out of disk space. As a result, the FlowFile Repository is >>> unable to be compacted. Since this is not happening, it will not mark >>> any of the Resource Claims as destructible. This would mean that >>> the Content Repository does not get cleaned up. So now you've got a full >>> Content Repository and it's unable to clean up after itself, because >>> no Resource Claims are getting marked as destructible. >>> >>> So to prove or disprove this theory, there are a few things that you can >>> look at: >>> >>> Do you see the following anywhere in your logs: Unable to checkpoint >>> FlowFile Repository >>> >>> If you add the following to your conf/logback.xml: >>> <logger >>> name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager" >>> level="DEBUG" /> >>> Then that should allow you to see a DEBUG-level log message every time >>> that a Resource Claim is marked destructible and every time >>> that the Content Repository requests the collection of Destructible >>> Claims ("Drained 100 destructable claims" for instance) >>> >>> Any of the logs related to those statements should be very valuable in >>> determining what's going on. >>> >>> Thanks again for all of the detailed analysis. Hopefully we can get this >>> all squared away and taken care of quickly! >>> >>> -Mark >>> >>> >>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <[email protected]<mailto: >>> [email protected]>> wrote: >>> >>> Here is what I have figured out so far. >>> >>> The cleanups are scheduled at https://github.com/apache/nifi >>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra >>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con >>> troller/repository/FileSystemRepository.java#L232 >>> >>> I'm not totally sure which one of those is the one that should be >>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or >>> DestroyExpiredArchiveClaims, both of which are in that class, and both of >>> which are scheduled with scheduleWithFixedDelay. Based on docs at >>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre >>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j >>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) if >>> those methods fail once, they will stop running forever. Also if the thread >>> got stuck it wouldn't launch a new one. >>> >>> I then hoped I would go into the logs, see a failure, and use it to >>> figure out the issue. >>> >>> What I'm seeing instead is things like this, which comes from >>> BinDestructableClaims: >>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] >>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from archive >>> for Container default; oldest Archive Date is now Sat Dec 10 22:09:53 PST >>> 2016; container cleanup took 34266 millis >>> that are somewhat frequent (as often as once per second, which is the >>> scheduling frequency). Then, eventually, they just stop. Unfortunately >>> there isn't an error message I can find that's killing these. >>> >>> At nifi startup, I see messages like this, which come from something >>> (not sure what yet) calling the cleanup() method on FileSystemRepository: >>> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.FileSystemRepository >>> Found unknown file /home/cops/edh-bundle-extracto >>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File >>> System Repository; removing file >>> I never see those after the initial cleanup that happens on restart. >>> >>> I attached a thread dump. I noticed at the top that there is a cleanup >>> thread parked. I took 10 more thread dumps after this and in every one of >>> them the cleanup thread was parked. That thread looks like it corresponds >>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that >>> if the cleanup task I need were running, it would be in one of the >>> FileSystemRepository Workers. However, in all of my thread dumps, these >>> were always all parked. >>> >>> Attached one of the thread dumps. >>> >>> Thanks, >>> Alan >>> >>> >>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <[email protected] >>> <mailto:[email protected]>> wrote: >>> Alan, >>> >>> >>> It's possible that you've run into some sort of bug that is preventing >>> >>> it from cleaning up the Content Repository properly. While it's stuck >>> >>> in this state, could you capture a thread dump (bin/nifi.sh dump >>> thread-dump.txt)? >>> >>> That would help us determine if there is something going on that is >>> >>> preventing the cleanup from happening. >>> >>> >>> Thanks >>> >>> -Mark >>> >>> >>> ________________________________ >>> From: Alan Jackoway <[email protected]<mailto:[email protected]>> >>> Sent: Sunday, December 11, 2016 11:11 AM >>> To: [email protected]<mailto:[email protected]> >>> Subject: Re: Content Repository Cleanup >>> >>> This just filled up again even >>> with nifi.content.repository.archive.enabled=false. >>> >>> On the node that is still alive, our queued flowfiles are 91 / 16.47 GB, >>> but the content repository directory is using 646 GB. >>> >>> Is there a property I can set to make it clean things up more >>> frequently? I >>> expected that once I turned archive enabled off, it would delete things >>> from the content repository as soon as the flow files weren't queued >>> anywhere. So far the only way I have found to reliably get nifi to clear >>> out the content repository is to restart it. >>> >>> Our version string is the following, if that interests you: >>> 11/26/2016 04:39:37 PST >>> Tagged nifi-1.1.0-RC2 >>> From ${buildRevision} on branch ${buildBranch} >>> >>> Maybe we will go to the released 1.1 and see if that helps. Until then >>> I'll >>> be restarting a lot and digging into the code to figure out where this >>> cleanup is supposed to happen. Any pointers on code/configs for that >>> would >>> be appreciated. >>> >>> Thanks, >>> Alan >>> >>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <[email protected]<mailto: >>> [email protected]>> wrote: >>> >>> > No, in my scenario a server restart would not affect the content >>> repository >>> > size. >>> > >>> > On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <[email protected] >>> <mailto:[email protected]>> wrote: >>> > >>> > > If we were in the situation Joe G described, should we expect that >>> when >>> > we >>> > > kill and restart nifi it would clean everything up? That behavior has >>> > been >>> > > consistent every time - when the disk hits 100%, we kill nifi, delete >>> > > enough old content files to bring it back up, and before it bring >>> the UI >>> > up >>> > > it deletes things to get within the archive policy again. That sounds >>> > less >>> > > like the files are stuck and more like it failed trying. >>> > > >>> > > For now I just turned off archiving, since we don't really need it >>> for >>> > > this use case. >>> > > >>> > > I attached a jstack from last night's failure, which looks pretty >>> boring >>> > > to me. >>> > > >>> > > On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <[email protected] >>> <mailto:[email protected]>> >>> > wrote: >>> > > >>> > >> The scenario Joe G describes is almost exactly what we are doing. We >>> > >> bring in large files and unpack them into many smaller ones. In the >>> most >>> > >> recent iteration of this problem, I saw that we had many small files >>> > queued >>> > >> up at the time trouble was happening. We will try your suggestion to >>> > see if >>> > >> the situation improves. >>> > >> >>> > >> Thanks, >>> > >> Alan >>> > >> >>> > >> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <[email protected] >>> <mailto:[email protected]>> >>> > wrote: >>> > >> >>> > >>> Not sure if your scenario is related, but one of the NiFi devs >>> recently >>> > >>> explained to me that the files in the content repository are >>> actually >>> > >>> appended together with other flow file content (please correct me >>> if >>> > I'm >>> > >>> explaining it wrong). That means if you have many small flow >>> files in >>> > >>> your >>> > >>> current backlog, and several large flow files have recently left >>> the >>> > >>> flow, >>> > >>> the large ones could still be hanging around in the content >>> repository >>> > as >>> > >>> long as the small ones are still there, if they're in the same >>> appended >>> > >>> files on disk. >>> > >>> >>> > >>> This scenario recently happened to us: we had a flow with ~20 >>> million >>> > >>> tiny >>> > >>> flow files queued up, and at the same time we were also processing >>> a >>> > >>> bunch >>> > >>> of 1GB files, which left the flow quickly. The content repository >>> was >>> > >>> much >>> > >>> larger than what was actually being reported in the flow stats, >>> and our >>> > >>> disks were almost full. On a hunch, I tried the following >>> strategy: >>> > >>> - MergeContent the tiny flow files using flow-file-v3 format (to >>> > capture >>> > >>> all attributes) >>> > >>> - MergeContent 10,000 of the packaged flow files using tar format >>> for >>> > >>> easier storage on disk >>> > >>> - PutFile into a directory >>> > >>> - GetFile from the same directory, but using back pressure from >>> here on >>> > >>> out >>> > >>> (so that the flow simply wouldn't pull the same files from disk >>> until >>> > it >>> > >>> was really ready for them) >>> > >>> - UnpackContent (untar them) >>> > >>> - UnpackContent (turn them back into flow files with the original >>> > >>> attributes) >>> > >>> - Then do the processing they were originally designed for >>> > >>> >>> > >>> This had the effect of very quickly reducing the size of my content >>> > >>> repository to very nearly the actual size I saw reported in the >>> flow, >>> > and >>> > >>> my disk usage dropped from ~95% to 50%, which is the configured >>> content >>> > >>> repository max usage percentage. I haven't had any problems since. >>> > >>> >>> > >>> Hope this helps. >>> > >>> Joe >>> > >>> >>> > >>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <[email protected] >>> <mailto:[email protected]>> wrote: >>> > >>> >>> > >>> > Alan, >>> > >>> > >>> > >>> > That retention percentage only has to do with the archive of data >>> > >>> > which kicks in once a given chunk of content is no longer >>> reachable >>> > by >>> > >>> > active flowfiles in the flow. For it to grow to 100% typically >>> would >>> > >>> > mean that you have data backlogged in the flow that account for >>> that >>> > >>> > much space. If that is certainly not the case for you then we >>> need >>> > to >>> > >>> > dig deeper. If you could do screenshots or share log files and >>> stack >>> > >>> > dumps around this time those would all be helpful. If the >>> > screenshots >>> > >>> > and such are too sensitive please just share as much as you can. >>> > >>> > >>> > >>> > Thanks >>> > >>> > Joe >>> > >>> > >>> > >>> > On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < >>> [email protected]<mailto:[email protected]>> >>> > >>> wrote: >>> > >>> > > One other note on this, when it came back up there were tons of >>> > >>> messages >>> > >>> > > like this: >>> > >>> > > >>> > >>> > > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. >>> > >>> > FileSystemRepository >>> > >>> > > Found unknown file /path/to/content_repository/49 >>> > >>> 8/1481329796415-87538 >>> > >>> > > (1071114 bytes) in File System Repository; archiving file >>> > >>> > > >>> > >>> > > I haven't dug into what that means. >>> > >>> > > Alan >>> > >>> > > >>> > >>> > > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < >>> [email protected]<mailto:[email protected]>> >>> > >>> > wrote: >>> > >>> > > >>> > >>> > >> Hello, >>> > >>> > >> >>> > >>> > >> We have a node on which nifi content repository keeps growing >>> to >>> > use >>> > >>> > 100% >>> > >>> > >> of the disk. It's a relatively high-volume process. It chewed >>> > >>> through >>> > >>> > more >>> > >>> > >> than 100GB in the three hours between when we first saw it hit >>> > 100% >>> > >>> of >>> > >>> > the >>> > >>> > >> disk and when we just cleaned it up again. >>> > >>> > >> >>> > >>> > >> We are running nifi 1.1 for this. Our nifi.properties looked >>> like >>> > >>> this: >>> > >>> > >> >>> > >>> > >> nifi.content.repository.implementation=org.apache. >>> > >>> > >> nifi.controller.repository.FileSystemRepository >>> > >>> > >> nifi.content.claim.max.appendable.size=10 MB >>> > >>> > >> nifi.content.claim.max.flow.files=100 >>> > >>> > >> nifi.content.repository.directory.default=./content_reposito >>> ry >>> > >>> > >> nifi.content.repository.archive.max.retention.period=12 hours >>> > >>> > >> nifi.content.repository.archive.max.usage.percentage=50% >>> > >>> > >> nifi.content.repository.archive.enabled=true >>> > >>> > >> nifi.content.repository.always.sync=false >>> > >>> > >> >>> > >>> > >> I just bumped retention period down to 2 hours, but should max >>> > usage >>> > >>> > >> percentage protect us from using 100% of the disk? >>> > >>> > >> >>> > >>> > >> Unfortunately we didn't get jstacks on either failure. If it >>> hits >>> > >>> 100% >>> > >>> > >> again I will make sure to get that. >>> > >>> > >> >>> > >>> > >> Thanks, >>> > >>> > >> Alan >>> > >>> > >> >>> > >>> > >>> > >>> >>> > >>> >>> > >>> >>> > >>> -- >>> > >>> I know what it is to be in need, and I know what it is to have >>> > plenty. I >>> > >>> have learned the secret of being content in any and every >>> situation, >>> > >>> whether well fed or hungry, whether living in plenty or in want. >>> I can >>> > >>> do >>> > >>> all this through him who gives me strength. *-Philippians >>> 4:12-13* >>> > >>> >>> > >> >>> > >> >>> > > >>> > >>> > >>> > -- >>> > I know what it is to be in need, and I know what it is to have >>> plenty. I >>> > have learned the secret of being content in any and every situation, >>> > whether well fed or hungry, whether living in plenty or in want. I >>> can do >>> > all this through him who gives me strength. *-Philippians 4:12-13* >>> > >>> >>> <thread-dump.txt> >>> >>> >> >
