I'm going to move future conversation to that jira, but I've tried a few things now and all of them seem to strand files in the content repository. My question, reworded is:
How do you clean up new flow files that a processor created when the session rolls back / penalizes? On Thu, Dec 15, 2016 at 11:24 AM, Alan Jackoway <al...@cloudera.com> wrote: > Update: session.remove(newFiles) does not work. I filed > https://issues.apache.org/jira/browse/NIFI-3205 > > On Thu, Dec 15, 2016 at 11:05 AM, Alan Jackoway <al...@cloudera.com> > wrote: > >> I am getting the successfully checkpointed message. >> >> I think I figured this out. Now we have to decide whether it's an issue >> in nifi or an issue in our code. >> >> This flow has a process that takes large zip files, unzips them, and does >> some processing of the files. I noticed that the disk space thing seems to >> go up fastest when there is a large file that is failing in the middle of >> one of these steps. I then suspected that something about the way we were >> creating new flow files out of the zips was the problem. >> >> I simulated what we were doing with the following processor in a new nifi >> 1.1. The processor takes an input file, copies it 5 times (to simulate >> unzip / process), then throws a runtime exception. I then wired a >> GenerateFlowFile of 100KB to it. I noticed the following characteristics: >> * Each time it ran, the size of the content repository went up exactly >> 500KB. >> * When I restarted the nifi, I got the messages about unknown files in >> the FileSystemRepository. >> >> So basically what this boils down to is: who is responsible to remove >> files from a session when a failure occurs? Should we be doing that (I will >> test next that calling session.remove before the error fixes the problem) >> or should the session keep track of new flow files that it created. We >> assumed the session would do so because the session yells at us if we fail >> to give a transport relationship for one of the files. >> >> Thanks for all the help with this. I think we are closing in on the point >> where I have either a fix or a bug filed or both. >> >> Test processor I used: >> // Copyright 2016 (c) Cloudera >> package com.cloudera.edh.nifi.processors.bundles; >> >> import com.google.common.collect.Lists; >> >> import java.io.IOException; >> import java.io.InputStream; >> import java.io.OutputStream; >> import java.util.List; >> >> import org.apache.nifi.annotation.behavior.InputRequirement; >> import org.apache.nifi.annotation.behavior.InputRequirement.Requirement; >> import org.apache.nifi.flowfile.FlowFile; >> import org.apache.nifi.processor.AbstractProcessor; >> import org.apache.nifi.processor.ProcessContext; >> import org.apache.nifi.processor.ProcessSession; >> import org.apache.nifi.processor.exception.ProcessException; >> import org.apache.nifi.processor.io.InputStreamCallback; >> import org.apache.nifi.processor.io.OutputStreamCallback; >> import org.apache.nifi.stream.io.StreamUtils; >> >> /** >> * Makes 5 copies of an incoming file, then fails and rolls back. >> */ >> @InputRequirement(value = Requirement.INPUT_REQUIRED) >> public class CopyAndFail extends AbstractProcessor { >> @Override >> public void onTrigger(ProcessContext context, ProcessSession session) >> throws ProcessException { >> FlowFile inputFile = session.get(); >> if (inputFile == null) { >> context.yield(); >> return; >> } >> final List<FlowFile> newFiles = Lists.newArrayList(); >> >> // Copy the file 5 times (simulates us opening a zip file and >> unpacking its contents) >> for (int i = 0; i < 5; i++) { >> session.read(inputFile, new InputStreamCallback() { >> @Override >> public void process(InputStream inputStream) throws IOException { >> FlowFile ff = session.create(inputFile); >> ff = session.write(ff, new OutputStreamCallback() { >> @Override >> public void process(final OutputStream out) throws >> IOException { >> StreamUtils.copy(inputStream, out); >> } >> }); >> newFiles.add(ff); >> } >> }); >> } >> >> // THIS IS WHERE I WILL PUT session.remove TO VERIFY THAT WORKS >> >> // Simulate an error handling some file in the zip after unpacking >> the rest >> throw new RuntimeException(); >> } >> } >> >> >> On Wed, Dec 14, 2016 at 9:23 PM, Mark Payne <marka...@hotmail.com> wrote: >> >>> I'd be very curious to see if changing the limits addresses the issue. >>> The OOME can certainly be an issue, as well. Once that gets thrown >>> anywhere >>> in the JVM, it's hard to vouch for the stability of the JVM at all. >>> >>> Seeing the claimant count drop to 0 then back up to 1, 2, and down to 1, >>> 0 again >>> is pretty common. The fact that you didn't see it marked as destructible >>> is interesting. >>> Around that same time, are you seeing log messages indicating that the >>> FlowFile repo >>> is checkpointing? Would have the words "Successfully checkpointed >>> FlowFile Repository" >>> That should happen every 2 minutes, approximately. >>> >>> >>> On Dec 14, 2016, at 8:56 PM, Alan Jackoway <al...@cloudera.com<mailto: >>> al...@cloudera.com>> wrote: >>> >>> I agree the limits sound low and will address that tomorrow. >>> >>> I'm not seeing FileNotFound or NoSuchFile. >>> >>> Here's an example file: >>> grep 1481763927251 logs/nifi-app.log >>> 2016-12-14 17:05:27,277 DEBUG [Timer-Driven Process Thread-36] >>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 1 >>> 2016-12-14 17:05:27,357 DEBUG [Timer-Driven Process Thread-2] >>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 2 >>> 2016-12-14 17:05:27,684 DEBUG [Timer-Driven Process Thread-36] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 1 >>> 2016-12-14 17:05:27,732 DEBUG [Timer-Driven Process Thread-2] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 0 >>> 2016-12-14 17:05:27,909 DEBUG [Timer-Driven Process Thread-14] >>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 1 >>> 2016-12-14 17:05:27,945 DEBUG [Timer-Driven Process Thread-14] >>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 2 >>> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 1 >>> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for >>> StandardResourceClaim[id=1481763927251-1, container=default, section=1] >>> to 0 >>> >>> This nifi-app.log covers a period when the nifi only handled two sets of >>> files for a total of maybe 10GB uncompressed. Content repository went >>> over >>> 100GB in that time. I checked a few content repository files, and they >>> all >>> had similar patterns - claims hit 0 twice - once around 17:05 and once >>> around 17:14, then nothing. I brought down the nifi around 17:30. >>> >>> During that time, we did have a processor hitting OutOfMemory while >>> unpacking a 1GB file. I'm adjusting the heap to try to make that succeed >>> in >>> case that was related. >>> >>> On Wed, Dec 14, 2016 at 8:32 PM, Mark Payne <marka...@hotmail.com >>> <mailto:marka...@hotmail.com>> wrote: >>> >>> OK, so these are generally the default values for most linux systems. >>> These are a little low, >>> though for what NiFi recommends and often needs. With these settings, you >>> can easily run >>> out of open file handles. When this happens, trying to access a file will >>> return a FileNotFoundException >>> even though the file exists and permissions all look good. As a result, >>> NiFi may be failing to >>> delete the data simply because it can't get an open file handle. >>> >>> The admin guide [1] explains the best practices for configuring these >>> settings. Generally, after updating >>> these settings, I think you have to logout of the machine and login again >>> for the changes to take effect. >>> Would recommend you update these settings and also search logs for >>> "FileNotFound" as well as >>> "NoSuchFile" and see if that hits anywhere. >>> >>> >>> [1] http://nifi.apache.org/docs/nifi-docs/html/administration- >>> guide.html#configuration-best-practices >>> >>> >>> On Dec 14, 2016, at 8:25 PM, Alan Jackoway <al...@cloudera.com<mailto: >>> al...@cloudera.com><mailto:ala >>> n...@cloudera.com<mailto:n...@cloudera.com>>> wrote: >>> >>> We haven't let the disk hit 100% in a while, but it's been crossing 90%. >>> We >>> haven't seen the "Unable to checkpoint" message in the last 24 hours. >>> >>> $ ulimit -Hn >>> 4096 >>> $ ulimit -Sn >>> 1024 >>> >>> I will work on tracking a specific file next. >>> >>> >>> On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com><mailto: >>> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: >>> >>> At first I thought that the drained messages always said 0, but that's >>> not >>> right. What should the total number of claims drained be? The number of >>> flowfiles that made it through the system? If so, I think our number is >>> low: >>> >>> $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14* | >>> grep >>> -v "Drained 0" | awk '{sum += $9} END {print sum}' >>> 25296 >>> >>> I'm not sure how to get the count of flowfiles that moved through, but I >>> suspect that's low by an order of magnitude. That instance of nifi has >>> handled 150k files in the last 6 hours, most of which went through a >>> number >>> of processors and transformations. >>> >>> Should the number of drained claims correspond to the number of flow >>> files >>> that moved through the system? >>> Alan >>> >>> On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com><mailto: >>> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote: >>> >>> Some updates: >>> * We fixed the issue with missing transfer relationships, and this did >>> not go away. >>> * We saw this a few minutes ago when the queue was at 0. >>> >>> What should I be looking for in the logs to figure out the issue? >>> >>> Thanks, >>> Alan >>> >>> On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com> >>> <mailto:al...@cloudera.com>> >>> wrote: >>> >>> In case this is interesting, I think this started getting bad when we >>> started hitting an error where some of our files were not given a >>> transfer >>> relationship. Maybe some combination of not giving flow files a >>> relationship and the subsequent penalization is causing the problem. >>> >>> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com> >>> <mailto:al...@cloudera.com>> >>> wrote: >>> >>> Everything is at the default locations for these nifis. >>> >>> On one of the two machines, I did find log messages like you suggested: >>> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1] >>> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile >>> Repository due to java.io.FileNotFoundException: >>> ./flowfile_repository/partition-14/3169.journal (No space left on >>> device) >>> >>> I added the logger, which apparently takes effect right away. What am I >>> looking for in this logs? I see a lot of stuff like: >>> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555893660-3174, container=default, >>> section=102] to 0 >>> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555922818-3275, container=default, >>> section=203] to 191 >>> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555880393-3151, container=default, >>> section=79] to 142 >>> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555872053-3146, container=default, >>> section=74] to 441 >>> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555893954-3178, container=default, >>> section=106] to 2 >>> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24] >>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count >>> for StandardResourceClaim[id=1481555893696-3175, container=default, >>> section=103] to 1 >>> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1] >>> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims >>> to [] >>> >>> What's puzzling to me is that both of these machines have > 100GB of >>> free space, and I have never seen the queued size go above 20GB. It seems >>> to me like it gets into a state where nothing is deleted long before it >>> runs out of disk space. >>> >>> Thanks, >>> Alan >>> >>> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com >>> <mailto:marka...@hotmail.com><mailto:m >>> arka...@hotmail.com<mailto:arka...@hotmail.com>>> >>> wrote: >>> >>> Alan, >>> >>> Thanks for the thread-dump and the in-depth analysis! >>> >>> So in terms of the two tasks there, here's a quick explanation of what >>> each does: >>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which >>> maps to a file on disk) is no longer referenced >>> by any FlowFile, it can be either archived or destroyed (depending on >>> whether the property in nifi.properties has archiving >>> enabled). >>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource >>> Claims that are archived have to eventually >>> age off. This task is responsible for ensuring that this happens. >>> >>> As you mentioned, in the Executor, if the Runnable fails it will stop >>> running forever, and if the thread gets stuck, another will >>> not be launched. Neither of these appears to be the case. I say this >>> because both of those Runnables are wrapped entirely >>> within a try { ... } catch (Throwable t) {...}. So the method will >>> never end Exceptionally. Also, the thread dump shows all of the >>> threads created by that Thread Pool (those whose names begin with >>> "FileSystemRepository Workers Thread-") in WAITING >>> or TIMED_WAITING state. This means that they are sitting in the >>> Executor waiting to be scheduled to do something else, >>> so they aren't stuck in any kind of infinite loop or anything like >>> that. >>> >>> Now, with all of that being said, I have a theory as to what could >>> perhaps be happening :) >>> >>> From the configuration that you listed below, it shows that the >>> content repository is located at ./content_repository, which is >>> the default. Is the FlowFile Repository also located at the default >>> location of ./flowfile_repository? The reason that I ask is this: >>> >>> When I said above that a Resource Claim is marked destructible when no >>> more FlowFiles reference it, that was a bit of a >>> simplification. A more detailed explanation is this: when the FlowFile >>> Repository is checkpointed (this happens every 2 minutes >>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" >>> or "compacted" or however you like to refer to it). When this >>> happens, we do an fsync() to ensure that the data is stored safely on >>> disk. Only then do we actually mark a claim as destructible. >>> This is done in order to ensure that if there is a power outage and a >>> FlowFile Repository update wasn't completely flushed to disk, >>> that we can recover. For instance, if the content of a FlowFile >>> changes from Resource Claim A to Resource Claim B and as a result >>> we delete Resource Claim A and then lose power, it's possible that the >>> FlowFile Repository didn't flush that update to disk; as a result, >>> on restart, we may still have that FlowFile pointing to Resource Claim >>> A which is now deleted, so we would end up having data loss. >>> This method of only deleting Resource Claims after the FlowFile >>> Repository has been fsync'ed means that we know on restart that >>> Resource Claim A won't still be referenced. >>> >>> So that was probably a very wordy, verbose description of what happens >>> but I'm trying to make sure that I explain things adequately. >>> So with that background... if you are storing your FlowFile Repository >>> on the same volume as your Content Repository, the following >>> could happen: >>> >>> At some point in time, enough data is queued up in your flow for you >>> to run out of disk space. As a result, the FlowFile Repository is >>> unable to be compacted. Since this is not happening, it will not mark >>> any of the Resource Claims as destructible. This would mean that >>> the Content Repository does not get cleaned up. So now you've got a >>> full Content Repository and it's unable to clean up after itself, because >>> no Resource Claims are getting marked as destructible. >>> >>> So to prove or disprove this theory, there are a few things that you >>> can look at: >>> >>> Do you see the following anywhere in your logs: Unable to checkpoint >>> FlowFile Repository >>> >>> If you add the following to your conf/logback.xml: >>> <logger name="org.apache.nifi.controller.repository.claim. >>> StandardResourceClaimManager" >>> level="DEBUG" /> >>> Then that should allow you to see a DEBUG-level log message every time >>> that a Resource Claim is marked destructible and every time >>> that the Content Repository requests the collection of Destructible >>> Claims ("Drained 100 destructable claims" for instance) >>> >>> Any of the logs related to those statements should be very valuable in >>> determining what's going on. >>> >>> Thanks again for all of the detailed analysis. Hopefully we can get >>> this all squared away and taken care of quickly! >>> >>> -Mark >>> >>> >>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto: >>> al...@cloudera.com><mailto:ala >>> n...@cloudera.com<mailto:n...@cloudera.com>><mailto: >>> al...@cloudera.com<mailto:al...@cloudera.com><mailto:al...@cloudera.com>>> >>> wrote: >>> >>> Here is what I have figured out so far. >>> >>> The cleanups are scheduled at https://github.com/apache/nifi >>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra >>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con >>> troller/repository/FileSystemRepository.java#L232 >>> >>> I'm not totally sure which one of those is the one that should be >>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or >>> DestroyExpiredArchiveClaims, both of which are in that class, and both of >>> which are scheduled with scheduleWithFixedDelay. Based on docs at >>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre >>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j >>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) >>> if those methods fail once, they will stop running forever. Also if the >>> thread got stuck it wouldn't launch a new one. >>> >>> I then hoped I would go into the logs, see a failure, and use it to >>> figure out the issue. >>> >>> What I'm seeing instead is things like this, which comes from >>> BinDestructableClaims: >>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default] >>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from >>> archive for Container default; oldest Archive Date is now Sat Dec 10 >>> 22:09:53 PST 2016; container cleanup took 34266 millis >>> that are somewhat frequent (as often as once per second, which is the >>> scheduling frequency). Then, eventually, they just stop. Unfortunately >>> there isn't an error message I can find that's killing these. >>> >>> At nifi startup, I see messages like this, which come from something >>> (not sure what yet) calling the cleanup() method on FileSystemRepository: >>> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository. >>> FileSystemRepository >>> Found unknown file /home/cops/edh-bundle-extracto >>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File >>> System Repository; removing file >>> I never see those after the initial cleanup that happens on restart. >>> >>> I attached a thread dump. I noticed at the top that there is a cleanup >>> thread parked. I took 10 more thread dumps after this and in every one of >>> them the cleanup thread was parked. That thread looks like it corresponds >>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe >>> that >>> if the cleanup task I need were running, it would be in one of the >>> FileSystemRepository Workers. However, in all of my thread dumps, these >>> were always all parked. >>> >>> Attached one of the thread dumps. >>> >>> Thanks, >>> Alan >>> >>> >>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com >>> <mailto:marka...@hotmail.com>> wrote: >>> Alan, >>> >>> >>> It's possible that you've run into some sort of bug that is preventing >>> >>> it from cleaning up the Content Repository properly. While it's stuck >>> >>> in this state, could you capture a thread dump (bin/nifi.sh dump >>> thread-dump.txt)? >>> >>> That would help us determine if there is something going on that is >>> >>> preventing the cleanup from happening. >>> >>> >>> Thanks >>> >>> -Mark >>> >>> >>> ________________________________ >>> From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>> >>> Sent: Sunday, December 11, 2016 11:11 AM >>> To: dev@nifi.apache.org<mailto:dev@nifi.apache.org> >>> Subject: Re: Content Repository Cleanup >>> >>> This just filled up again even >>> with nifi.content.repository.archive.enabled=false. >>> >>> On the node that is still alive, our queued flowfiles are 91 / 16.47 >>> GB, >>> but the content repository directory is using 646 GB. >>> >>> Is there a property I can set to make it clean things up more >>> frequently? I >>> expected that once I turned archive enabled off, it would delete things >>> from the content repository as soon as the flow files weren't queued >>> anywhere. So far the only way I have found to reliably get nifi to >>> clear >>> out the content repository is to restart it. >>> >>> Our version string is the following, if that interests you: >>> 11/26/2016 04:39:37 PST >>> Tagged nifi-1.1.0-RC2 >>> From ${buildRevision} on branch ${buildBranch} >>> >>> Maybe we will go to the released 1.1 and see if that helps. Until then >>> I'll >>> be restarting a lot and digging into the code to figure out where this >>> cleanup is supposed to happen. Any pointers on code/configs for that >>> would >>> be appreciated. >>> >>> Thanks, >>> Alan >>> >>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com >>> <mailto:jgres...@gmail.com>> wrote: >>> >>> No, in my scenario a server restart would not affect the content >>> repository >>> size. >>> >>> On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com>> wrote: >>> >>> If we were in the situation Joe G described, should we expect that >>> when >>> we >>> kill and restart nifi it would clean everything up? That behavior >>> has >>> been >>> consistent every time - when the disk hits 100%, we kill nifi, >>> delete >>> enough old content files to bring it back up, and before it bring >>> the UI >>> up >>> it deletes things to get within the archive policy again. That >>> sounds >>> less >>> like the files are stuck and more like it failed trying. >>> >>> For now I just turned off archiving, since we don't really need it >>> for >>> this use case. >>> >>> I attached a jstack from last night's failure, which looks pretty >>> boring >>> to me. >>> >>> On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com >>> <mailto:al...@cloudera.com>> >>> wrote: >>> >>> The scenario Joe G describes is almost exactly what we are doing. >>> We >>> bring in large files and unpack them into many smaller ones. In >>> the most >>> recent iteration of this problem, I saw that we had many small >>> files >>> queued >>> up at the time trouble was happening. We will try your suggestion >>> to >>> see if >>> the situation improves. >>> >>> Thanks, >>> Alan >>> >>> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com >>> <mailto:jgres...@gmail.com>> >>> wrote: >>> >>> Not sure if your scenario is related, but one of the NiFi devs >>> recently >>> explained to me that the files in the content repository are >>> actually >>> appended together with other flow file content (please correct >>> me if >>> I'm >>> explaining it wrong). That means if you have many small flow >>> files in >>> your >>> current backlog, and several large flow files have recently left >>> the >>> flow, >>> the large ones could still be hanging around in the content >>> repository >>> as >>> long as the small ones are still there, if they're in the same >>> appended >>> files on disk. >>> >>> This scenario recently happened to us: we had a flow with ~20 >>> million >>> tiny >>> flow files queued up, and at the same time we were also >>> processing a >>> bunch >>> of 1GB files, which left the flow quickly. The content >>> repository was >>> much >>> larger than what was actually being reported in the flow stats, >>> and our >>> disks were almost full. On a hunch, I tried the following >>> strategy: >>> - MergeContent the tiny flow files using flow-file-v3 format (to >>> capture >>> all attributes) >>> - MergeContent 10,000 of the packaged flow files using tar >>> format for >>> easier storage on disk >>> - PutFile into a directory >>> - GetFile from the same directory, but using back pressure from >>> here on >>> out >>> (so that the flow simply wouldn't pull the same files from disk >>> until >>> it >>> was really ready for them) >>> - UnpackContent (untar them) >>> - UnpackContent (turn them back into flow files with the original >>> attributes) >>> - Then do the processing they were originally designed for >>> >>> This had the effect of very quickly reducing the size of my >>> content >>> repository to very nearly the actual size I saw reported in the >>> flow, >>> and >>> my disk usage dropped from ~95% to 50%, which is the configured >>> content >>> repository max usage percentage. I haven't had any problems >>> since. >>> >>> Hope this helps. >>> Joe >>> >>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com >>> <mailto:joe.w...@gmail.com>> wrote: >>> >>> Alan, >>> >>> That retention percentage only has to do with the archive of >>> data >>> which kicks in once a given chunk of content is no longer >>> reachable >>> by >>> active flowfiles in the flow. For it to grow to 100% >>> typically would >>> mean that you have data backlogged in the flow that account >>> for that >>> much space. If that is certainly not the case for you then we >>> need >>> to >>> dig deeper. If you could do screenshots or share log files >>> and stack >>> dumps around this time those would all be helpful. If the >>> screenshots >>> and such are too sensitive please just share as much as you >>> can. >>> >>> Thanks >>> Joe >>> >>> On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway < >>> al...@cloudera.com<mailto:al...@cloudera.com>> >>> wrote: >>> One other note on this, when it came back up there were tons >>> of >>> messages >>> like this: >>> >>> 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository. >>> FileSystemRepository >>> Found unknown file /path/to/content_repository/49 >>> 8/1481329796415-87538 >>> (1071114 bytes) in File System Repository; archiving file >>> >>> I haven't dug into what that means. >>> Alan >>> >>> On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway < >>> al...@cloudera.com<mailto:al...@cloudera.com>> >>> wrote: >>> >>> Hello, >>> >>> We have a node on which nifi content repository keeps >>> growing to >>> use >>> 100% >>> of the disk. It's a relatively high-volume process. It >>> chewed >>> through >>> more >>> than 100GB in the three hours between when we first saw it >>> hit >>> 100% >>> of >>> the >>> disk and when we just cleaned it up again. >>> >>> We are running nifi 1.1 for this. Our nifi.properties >>> looked like >>> this: >>> >>> nifi.content.repository.implementation=org.apache. >>> nifi.controller.repository.FileSystemRepository >>> nifi.content.claim.max.appendable.size=10 MB >>> nifi.content.claim.max.flow.files=100 >>> nifi.content.repository.direct >>> ory.default=./content_repository >>> nifi.content.repository.archive.max.retention.period=12 >>> hours >>> nifi.content.repository.archive.max.usage.percentage=50% >>> nifi.content.repository.archive.enabled=true >>> nifi.content.repository.always.sync=false >>> >>> I just bumped retention period down to 2 hours, but should >>> max >>> usage >>> percentage protect us from using 100% of the disk? >>> >>> Unfortunately we didn't get jstacks on either failure. If >>> it hits >>> 100% >>> again I will make sure to get that. >>> >>> Thanks, >>> Alan >>> >>> >>> >>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have >>> plenty. I >>> have learned the secret of being content in any and every >>> situation, >>> whether well fed or hungry, whether living in plenty or in >>> want. I can >>> do >>> all this through him who gives me strength. *-Philippians >>> 4:12-13* >>> >>> >>> >>> >>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have >>> plenty. I >>> have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I >>> can do >>> all this through him who gives me strength. *-Philippians 4:12-13* >>> >>> >>> <thread-dump.txt> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >