We haven't let the disk hit 100% in a while, but it's been crossing 90%. We haven't seen the "Unable to checkpoint" message in the last 24 hours.
$ ulimit -Hn 4096 $ ulimit -Sn 1024 I will work on tracking a specific file next. On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com> wrote: > At first I thought that the drained messages always said 0, but that's not > right. What should the total number of claims drained be? The number of > flowfiles that made it through the system? If so, I think our number is low: > > $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14* | grep > -v "Drained 0" | awk '{sum += $9} END {print sum}' > 25296 > > I'm not sure how to get the count of flowfiles that moved through, but I > suspect that's low by an order of magnitude. That instance of nifi has > handled 150k files in the last 6 hours, most of which went through a number > of processors and transformations. > > Should the number of drained claims correspond to the number of flow files > that moved through the system? > Alan > > On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com> wrote: > >> Some updates: >> * We fixed the issue with missing transfer relationships, and this did >> not go away. >> * We saw this a few minutes ago when the queue was at 0. >> >> What should I be looking for in the logs to figure out the issue? >> >> Thanks, >> Alan >> >> On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com> >> wrote: >> >>> In case this is interesting, I think this started getting bad when we >>> started hitting an error where some of our files were not given a transfer >>> relationship. Maybe some combination of not giving flow files a >>> relationship and the subsequent penalization is causing the problem. >>> >>> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com> >>> wrote: >>> >>>> Everything is at the default locations for these nifis. >>>> >>>> On one of the two machines, I did find log messages like you suggested: >>>> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] >>>> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile >>>> Repository due to java.io.FileNotFoundException: >>>> ./flowfile_repository/partition-14/3169.journal (No space left on >>>> device) >>>> >>>> I added the logger, which apparently takes effect right away. What am I >>>> looking for in this logs? I see a lot of stuff like: >>>> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555893660-3174, container=default, >>>> section=102] to 0 >>>> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555922818-3275, container=default, >>>> section=203] to 191 >>>> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555880393-3151, container=default, >>>> section=79] to 142 >>>> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555872053-3146, container=default, >>>> section=74] to 441 >>>> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555893954-3178, container=default, >>>> section=106] to 2 >>>> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] >>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>>> for StandardResourceClaim[id=1481555893696-3175, container=default, >>>> section=103] to 1 >>>> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] >>>> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims >>>> to [] >>>> >>>> What's puzzling to me is that both of these machines have > 100GB of >>>> free space, and I have never seen the queued size go above 20GB. It seems >>>> to me like it gets into a state where nothing is deleted long before it >>>> runs out of disk space. >>>> >>>> Thanks, >>>> Alan >>>> >>>> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com> >>>> wrote: >>>> >>>>> Alan, >>>>> >>>>> Thanks for the thread-dump and the in-depth analysis! >>>>> >>>>> So in terms of the two tasks there, here's a quick explanation of what >>>>> each does: >>>>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which >>>>> maps to a file on disk) is no longer referenced >>>>> by any FlowFile, it can be either archived or destroyed (depending on >>>>> whether the property in nifi.properties has archiving >>>>> enabled). >>>>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource >>>>> Claims that are archived have to eventually >>>>> age off. This task is responsible for ensuring that this happens. >>>>> >>>>> As you mentioned, in the Executor, if the Runnable fails it will stop >>>>> running forever, and if the thread gets stuck, another will >>>>> not be launched. Neither of these appears to be the case. I say this >>>>> because both of those Runnables are wrapped entirely >>>>> within a try { ... } catch (Throwable t) {...}. So the method will >>>>> never end Exceptionally. Also, the thread dump shows all of the >>>>> threads created by that Thread Pool (those whose names begin with >>>>> "FileSystemRepository Workers Thread-") in WAITING >>>>> or TIMED_WAITING state. This means that they are sitting in the >>>>> Executor waiting to be scheduled to do something else, >>>>> so they aren't stuck in any kind of infinite loop or anything like >>>>> that. >>>>> >>>>> Now, with all of that being said, I have a theory as to what could >>>>> perhaps be happening :) >>>>> >>>>> From the configuration that you listed below, it shows that the >>>>> content repository is located at ./content_repository, which is >>>>> the default. Is the FlowFile Repository also located at the default >>>>> location of ./flowfile_repository? The reason that I ask is this: >>>>> >>>>> When I said above that a Resource Claim is marked destructible when no >>>>> more FlowFiles reference it, that was a bit of a >>>>> simplification. A more detailed explanation is this: when the FlowFile >>>>> Repository is checkpointed (this happens every 2 minutes >>>>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" >>>>> or "compacted" or however you like to refer to it). When this >>>>> happens, we do an fsync() to ensure that the data is stored safely on >>>>> disk. Only then do we actually mark a claim as destructible. >>>>> This is done in order to ensure that if there is a power outage and a >>>>> FlowFile Repository update wasn't completely flushed to disk, >>>>> that we can recover. For instance, if the content of a FlowFile >>>>> changes from Resource Claim A to Resource Claim B and as a result >>>>> we delete Resource Claim A and then lose power, it's possible that the >>>>> FlowFile Repository didn't flush that update to disk; as a result, >>>>> on restart, we may still have that FlowFile pointing to Resource Claim >>>>> A which is now deleted, so we would end up having data loss. >>>>> This method of only deleting Resource Claims after the FlowFile >>>>> Repository has been fsync'ed means that we know on restart that >>>>> Resource Claim A won't still be referenced. >>>>> >>>>> So that was probably a very wordy, verbose description of what happens >>>>> but I'm trying to make sure that I explain things adequately. >>>>> So with that background... if you are storing your FlowFile Repository >>>>> on the same volume as your Content Repository, the following >>>>> could happen: >>>>> >>>>> At some point in time, enough data is queued up in your flow for you >>>>> to run out of disk space. As a result, the FlowFile Repository is >>>>> unable to be compacted. Since this is not happening, it will not mark >>>>> any of the Resource Claims as destructible. This would mean that >>>>> the Content Repository does not get cleaned up. So now you've got a >>>>> full Content Repository and it's unable to clean up after itself, because >>>>> no Resource Claims are getting marked as destructible. >>>>> >>>>> So to prove or disprove this theory, there are a few things that you >>>>> can look at: >>>>> >>>>> Do you see the following anywhere in your logs: Unable to checkpoint >>>>> FlowFile Repository >>>>> >>>>> If you add the following to your conf/logback.xml: >>>>> <logger >>>>> name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager" >>>>> level="DEBUG" /> >>>>> Then that should allow you to see a DEBUG-level log message every time >>>>> that a Resource Claim is marked destructible and every time >>>>> that the Content Repository requests the collection of Destructible >>>>> Claims ("Drained 100 destructable claims" for instance) >>>>> >>>>> Any of the logs related to those statements should be very valuable in >>>>> determining what's going on. >>>>> >>>>> Thanks again for all of the detailed analysis. Hopefully we can get >>>>> this all squared away and taken care of quickly! >>>>> >>>>> -Mark >>>>> >>>>> >>>>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto: >>>>> al...@cloudera.com>> wrote: >>>>> >>>>> Here is what I have figured out so far. >>>>> >>>>> The cleanups are scheduled at https://github.com/apache/nifi >>>>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra >>>>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con >>>>> troller/repository/FileSystemRepository.java#L232 >>>>> >>>>> I'm not totally sure which one of those is the one that should be >>>>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or >>>>> DestroyExpiredArchiveClaims, both of which are in that class, and both of >>>>> which are scheduled with scheduleWithFixedDelay. Based on docs at >>>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre >>>>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j >>>>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) >>>>> if those methods fail once, they will stop running forever. Also if the >>>>> thread got stuck it wouldn't launch a new one. >>>>> >>>>> I then hoped I would go into the logs, see a failure, and use it to >>>>> figure out the issue. >>>>> >>>>> What I'm seeing instead is things like this, which comes from >>>>> BinDestructableClaims: >>>>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] >>>>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from >>>>> archive for Container default; oldest Archive Date is now Sat Dec 10 >>>>> 22:09:53 PST 2016; container cleanup took 34266 millis >>>>> that are somewhat frequent (as often as once per second, which is the >>>>> scheduling frequency). Then, eventually, they just stop. Unfortunately >>>>> there isn't an error message I can find that's killing these. >>>>> >>>>> At nifi startup, I see messages like this, which come from something >>>>> (not sure what yet) calling the cleanup() method on FileSystemRepository: >>>>> 2016-12-11 09:15:38,973 INFO [main] >>>>> o.a.n.c.repository.FileSystemRepository >>>>> Found unknown file /home/cops/edh-bundle-extracto >>>>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File >>>>> System Repository; removing file >>>>> I never see those after the initial cleanup that happens on restart. >>>>> >>>>> I attached a thread dump. I noticed at the top that there is a cleanup >>>>> thread parked. I took 10 more thread dumps after this and in every one of >>>>> them the cleanup thread was parked. That thread looks like it corresponds >>>>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that >>>>> if the cleanup task I need were running, it would be in one of the >>>>> FileSystemRepository Workers. However, in all of my thread dumps, these >>>>> were always all parked. >>>>> >>>>> Attached one of the thread dumps. >>>>> >>>>> Thanks, >>>>> Alan >>>>> >>>>> >>>>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com >>>>> <mailto:marka...@hotmail.com>> wrote: >>>>> Alan, >>>>> >>>>> >>>>> It's possible that you've run into some sort of bug that is preventing >>>>> >>>>> it from cleaning up the Content Repository properly. While it's stuck >>>>> >>>>> in this state, could you capture a thread dump (bin/nifi.sh dump >>>>> thread-dump.txt)? >>>>> >>>>> That would help us determine if there is something going on that is >>>>> >>>>> preventing the cleanup from happening. >>>>> >>>>> >>>>> Thanks >>>>> >>>>> -Mark >>>>> >>>>> >>>>> ________________________________ >>>>> From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> >>>>> Sent: Sunday, December 11, 2016 11:11 AM >>>>> To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> >>>>> Subject: Re: Content Repository Cleanup >>>>> >>>>> This just filled up again even >>>>> with nifi.content.repository.archive.enabled=false. >>>>> >>>>> On the node that is still alive, our queued flowfiles are 91 / 16.47 >>>>> GB, >>>>> but the content repository directory is using 646 GB. >>>>> >>>>> Is there a property I can set to make it clean things up more >>>>> frequently? I >>>>> expected that once I turned archive enabled off, it would delete things >>>>> from the content repository as soon as the flow files weren't queued >>>>> anywhere. So far the only way I have found to reliably get nifi to >>>>> clear >>>>> out the content repository is to restart it. >>>>> >>>>> Our version string is the following, if that interests you: >>>>> 11/26/2016 04:39:37 PST >>>>> Tagged nifi-1.1.0-RC2 >>>>> From ${buildRevision} on branch ${buildBranch} >>>>> >>>>> Maybe we will go to the released 1.1 and see if that helps. Until then >>>>> I'll >>>>> be restarting a lot and digging into the code to figure out where this >>>>> cleanup is supposed to happen. Any pointers on code/configs for that >>>>> would >>>>> be appreciated. >>>>> >>>>> Thanks, >>>>> Alan >>>>> >>>>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com >>>>> <mailto:jgres...@gmail.com>> wrote: >>>>> >>>>> > No, in my scenario a server restart would not affect the content >>>>> repository >>>>> > size. >>>>> > >>>>> > On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com >>>>> <mailto:al...@cloudera.com>> wrote: >>>>> > >>>>> > > If we were in the situation Joe G described, should we expect that >>>>> when >>>>> > we >>>>> > > kill and restart nifi it would clean everything up? That behavior >>>>> has >>>>> > been >>>>> > > consistent every time - when the disk hits 100%, we kill nifi, >>>>> delete >>>>> > > enough old content files to bring it back up, and before it bring >>>>> the UI >>>>> > up >>>>> > > it deletes things to get within the archive policy again. That >>>>> sounds >>>>> > less >>>>> > > like the files are stuck and more like it failed trying. >>>>> > > >>>>> > > For now I just turned off archiving, since we don't really need it >>>>> for >>>>> > > this use case. >>>>> > > >>>>> > > I attached a jstack from last night's failure, which looks pretty >>>>> boring >>>>> > > to me. >>>>> > > >>>>> > > On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com >>>>> <mailto:al...@cloudera.com>> >>>>> > wrote: >>>>> > > >>>>> > >> The scenario Joe G describes is almost exactly what we are doing. >>>>> We >>>>> > >> bring in large files and unpack them into many smaller ones. In >>>>> the most >>>>> > >> recent iteration of this problem, I saw that we had many small >>>>> files >>>>> > queued >>>>> > >> up at the time trouble was happening. We will try your suggestion >>>>> to >>>>> > see if >>>>> > >> the situation improves. >>>>> > >> >>>>> > >> Thanks, >>>>> > >> Alan >>>>> > >> >>>>> > >> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com >>>>> <mailto:jgres...@gmail.com>> >>>>> > wrote: >>>>> > >> >>>>> > >>> Not sure if your scenario is related, but one of the NiFi devs >>>>> recently >>>>> > >>> explained to me that the files in the content repository are >>>>> actually >>>>> > >>> appended together with other flow file content (please correct >>>>> me if >>>>> > I'm >>>>> > >>> explaining it wrong). That means if you have many small flow >>>>> files in >>>>> > >>> your >>>>> > >>> current backlog, and several large flow files have recently left >>>>> the >>>>> > >>> flow, >>>>> > >>> the large ones could still be hanging around in the content >>>>> repository >>>>> > as >>>>> > >>> long as the small ones are still there, if they're in the same >>>>> appended >>>>> > >>> files on disk. >>>>> > >>> >>>>> > >>> This scenario recently happened to us: we had a flow with ~20 >>>>> million >>>>> > >>> tiny >>>>> > >>> flow files queued up, and at the same time we were also >>>>> processing a >>>>> > >>> bunch >>>>> > >>> of 1GB files, which left the flow quickly. The content >>>>> repository was >>>>> > >>> much >>>>> > >>> larger than what was actually being reported in the flow stats, >>>>> and our >>>>> > >>> disks were almost full. On a hunch, I tried the following >>>>> strategy: >>>>> > >>> - MergeContent the tiny flow files using flow-file-v3 format (to >>>>> > capture >>>>> > >>> all attributes) >>>>> > >>> - MergeContent 10,000 of the packaged flow files using tar >>>>> format for >>>>> > >>> easier storage on disk >>>>> > >>> - PutFile into a directory >>>>> > >>> - GetFile from the same directory, but using back pressure from >>>>> here on >>>>> > >>> out >>>>> > >>> (so that the flow simply wouldn't pull the same files from disk >>>>> until >>>>> > it >>>>> > >>> was really ready for them) >>>>> > >>> - UnpackContent (untar them) >>>>> > >>> - UnpackContent (turn them back into flow files with the original >>>>> > >>> attributes) >>>>> > >>> - Then do the processing they were originally designed for >>>>> > >>> >>>>> > >>> This had the effect of very quickly reducing the size of my >>>>> content >>>>> > >>> repository to very nearly the actual size I saw reported in the >>>>> flow, >>>>> > and >>>>> > >>> my disk usage dropped from ~95% to 50%, which is the configured >>>>> content >>>>> > >>> repository max usage percentage. I haven't had any problems >>>>> since. >>>>> > >>> >>>>> > >>> Hope this helps. >>>>> > >>> Joe >>>>> > >>> >>>>> > >>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com >>>>> <mailto:joe.w...@gmail.com>> wrote: >>>>> > >>> >>>>> > >>> > Alan, >>>>> > >>> > >>>>> > >>> > That retention percentage only has to do with the archive of >>>>> data >>>>> > >>> > which kicks in once a given chunk of content is no longer >>>>> reachable >>>>> > by >>>>> > >>> > active flowfiles in the flow. For it to grow to 100% >>>>> typically would >>>>> > >>> > mean that you have data backlogged in the flow that account >>>>> for that >>>>> > >>> > much space. If that is certainly not the case for you then we >>>>> need >>>>> > to >>>>> > >>> > dig deeper. If you could do screenshots or share log files >>>>> and stack >>>>> > >>> > dumps around this time those would all be helpful. If the >>>>> > screenshots >>>>> > >>> > and such are too sensitive please just share as much as you >>>>> can. >>>>> > >>> > >>>>> > >>> > Thanks >>>>> > >>> > Joe >>>>> > >>> > >>>>> > >>> > On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < >>>>> al...@cloudera.com<mailto:al...@cloudera.com>> >>>>> > >>> wrote: >>>>> > >>> > > One other note on this, when it came back up there were tons >>>>> of >>>>> > >>> messages >>>>> > >>> > > like this: >>>>> > >>> > > >>>>> > >>> > > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. >>>>> > >>> > FileSystemRepository >>>>> > >>> > > Found unknown file /path/to/content_repository/49 >>>>> > >>> 8/1481329796415-87538 >>>>> > >>> > > (1071114 bytes) in File System Repository; archiving file >>>>> > >>> > > >>>>> > >>> > > I haven't dug into what that means. >>>>> > >>> > > Alan >>>>> > >>> > > >>>>> > >>> > > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < >>>>> al...@cloudera.com<mailto:al...@cloudera.com>> >>>>> > >>> > wrote: >>>>> > >>> > > >>>>> > >>> > >> Hello, >>>>> > >>> > >> >>>>> > >>> > >> We have a node on which nifi content repository keeps >>>>> growing to >>>>> > use >>>>> > >>> > 100% >>>>> > >>> > >> of the disk. It's a relatively high-volume process. It >>>>> chewed >>>>> > >>> through >>>>> > >>> > more >>>>> > >>> > >> than 100GB in the three hours between when we first saw it >>>>> hit >>>>> > 100% >>>>> > >>> of >>>>> > >>> > the >>>>> > >>> > >> disk and when we just cleaned it up again. >>>>> > >>> > >> >>>>> > >>> > >> We are running nifi 1.1 for this. Our nifi.properties >>>>> looked like >>>>> > >>> this: >>>>> > >>> > >> >>>>> > >>> > >> nifi.content.repository.implementation=org.apache. >>>>> > >>> > >> nifi.controller.repository.FileSystemRepository >>>>> > >>> > >> nifi.content.claim.max.appendable.size=10 MB >>>>> > >>> > >> nifi.content.claim.max.flow.files=100 >>>>> > >>> > >> nifi.content.repository.direct >>>>> ory.default=./content_repository >>>>> > >>> > >> nifi.content.repository.archive.max.retention.period=12 >>>>> hours >>>>> > >>> > >> nifi.content.repository.archive.max.usage.percentage=50% >>>>> > >>> > >> nifi.content.repository.archive.enabled=true >>>>> > >>> > >> nifi.content.repository.always.sync=false >>>>> > >>> > >> >>>>> > >>> > >> I just bumped retention period down to 2 hours, but should >>>>> max >>>>> > usage >>>>> > >>> > >> percentage protect us from using 100% of the disk? >>>>> > >>> > >> >>>>> > >>> > >> Unfortunately we didn't get jstacks on either failure. If >>>>> it hits >>>>> > >>> 100% >>>>> > >>> > >> again I will make sure to get that. >>>>> > >>> > >> >>>>> > >>> > >> Thanks, >>>>> > >>> > >> Alan >>>>> > >>> > >> >>>>> > >>> > >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> -- >>>>> > >>> I know what it is to be in need, and I know what it is to have >>>>> > plenty. I >>>>> > >>> have learned the secret of being content in any and every >>>>> situation, >>>>> > >>> whether well fed or hungry, whether living in plenty or in >>>>> want. I can >>>>> > >>> do >>>>> > >>> all this through him who gives me strength. *-Philippians >>>>> 4:12-13* >>>>> > >>> >>>>> > >> >>>>> > >> >>>>> > > >>>>> > >>>>> > >>>>> > -- >>>>> > I know what it is to be in need, and I know what it is to have >>>>> plenty. I >>>>> > have learned the secret of being content in any and every situation, >>>>> > whether well fed or hungry, whether living in plenty or in want. I >>>>> can do >>>>> > all this through him who gives me strength. *-Philippians 4:12-13* >>>>> > >>>>> >>>>> <thread-dump.txt> >>>>> >>>>> >>>> >>> >> >