Re: Content Repository Cleanup

Alan Jackoway Thu, 15 Dec 2016 09:08:45 -0800

I'm going to move future conversation to that jira, but I've tried a few
things now and all of them seem to strand files in the content repository.
My question, reworded is:


How do you clean up new flow files that a processor created when the
session rolls back / penalizes?

On Thu, Dec 15, 2016 at 11:24 AM, Alan Jackoway <al...@cloudera.com> wrote:

> Update: session.remove(newFiles) does not work. I filed
> https://issues.apache.org/jira/browse/NIFI-3205
>
> On Thu, Dec 15, 2016 at 11:05 AM, Alan Jackoway <al...@cloudera.com>
> wrote:
>
>> I am getting the successfully checkpointed message.
>>
>> I think I figured this out. Now we have to decide whether it's an issue
>> in nifi or an issue in our code.
>>
>> This flow has a process that takes large zip files, unzips them, and does
>> some processing of the files. I noticed that the disk space thing seems to
>> go up fastest when there is a large file that is failing in the middle of
>> one of these steps. I then suspected that something about the way we were
>> creating new flow files out of the zips was the problem.
>>
>> I simulated what we were doing with the following processor in a new nifi
>> 1.1. The processor takes an input file, copies it 5 times (to simulate
>> unzip / process), then throws a runtime exception. I then wired a
>> GenerateFlowFile of 100KB to it. I noticed the following characteristics:
>> * Each time it ran, the size of the content repository went up exactly
>> 500KB.
>> * When I restarted the nifi, I got the messages about unknown files in
>> the FileSystemRepository.
>>
>> So basically what this boils down to is: who is responsible to remove
>> files from a session when a failure occurs? Should we be doing that (I will
>> test next that calling session.remove before the error fixes the problem)
>> or should the session keep track of new flow files that it created. We
>> assumed the session would do so because the session yells at us if we fail
>> to give a transport relationship for one of the files.
>>
>> Thanks for all the help with this. I think we are closing in on the point
>> where I have either a fix or a bug filed or both.
>>
>> Test processor I used:
>> // Copyright 2016 (c) Cloudera
>> package com.cloudera.edh.nifi.processors.bundles;
>>
>> import com.google.common.collect.Lists;
>>
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.io.OutputStream;
>> import java.util.List;
>>
>> import org.apache.nifi.annotation.behavior.InputRequirement;
>> import org.apache.nifi.annotation.behavior.InputRequirement.Requirement;
>> import org.apache.nifi.flowfile.FlowFile;
>> import org.apache.nifi.processor.AbstractProcessor;
>> import org.apache.nifi.processor.ProcessContext;
>> import org.apache.nifi.processor.ProcessSession;
>> import org.apache.nifi.processor.exception.ProcessException;
>> import org.apache.nifi.processor.io.InputStreamCallback;
>> import org.apache.nifi.processor.io.OutputStreamCallback;
>> import org.apache.nifi.stream.io.StreamUtils;
>>
>> /**
>>  * Makes 5 copies of an incoming file, then fails and rolls back.
>>  */
>> @InputRequirement(value = Requirement.INPUT_REQUIRED)
>> public class CopyAndFail extends AbstractProcessor {
>>   @Override
>>   public void onTrigger(ProcessContext context, ProcessSession session)
>>       throws ProcessException {
>>     FlowFile inputFile = session.get();
>>     if (inputFile == null) {
>>       context.yield();
>>       return;
>>     }
>>     final List<FlowFile> newFiles = Lists.newArrayList();
>>
>>     // Copy the file 5 times (simulates us opening a zip file and
>> unpacking its contents)
>>     for (int i = 0; i < 5; i++) {
>>       session.read(inputFile, new InputStreamCallback() {
>>         @Override
>>         public void process(InputStream inputStream) throws IOException {
>>           FlowFile ff = session.create(inputFile);
>>           ff = session.write(ff, new OutputStreamCallback() {
>>             @Override
>>             public void process(final OutputStream out) throws
>> IOException {
>>               StreamUtils.copy(inputStream, out);
>>             }
>>           });
>>           newFiles.add(ff);
>>         }
>>       });
>>     }
>>
>>     // THIS IS WHERE I WILL PUT session.remove TO VERIFY THAT WORKS
>>
>>     // Simulate an error handling some file in the zip after unpacking
>> the rest
>>     throw new RuntimeException();
>>   }
>> }
>>
>>
>> On Wed, Dec 14, 2016 at 9:23 PM, Mark Payne <marka...@hotmail.com> wrote:
>>
>>> I'd be very curious to see if changing the limits addresses the issue.
>>> The OOME can certainly be an issue, as well. Once that gets thrown
>>> anywhere
>>> in the JVM, it's hard to vouch for the stability of the JVM at all.
>>>
>>> Seeing the claimant count drop to 0 then back up to 1, 2, and down to 1,
>>> 0 again
>>> is pretty common. The fact that you didn't see it marked as destructible
>>> is interesting.
>>> Around that same time, are you seeing log messages indicating that the
>>> FlowFile repo
>>> is checkpointing? Would have the words "Successfully checkpointed
>>> FlowFile Repository"
>>> That should happen every 2 minutes, approximately.
>>>
>>>
>>> On Dec 14, 2016, at 8:56 PM, Alan Jackoway <al...@cloudera.com<mailto:
>>> al...@cloudera.com>> wrote:
>>>
>>> I agree the limits sound low and will address that tomorrow.
>>>
>>> I'm not seeing FileNotFound or NoSuchFile.
>>>
>>> Here's an example file:
>>> grep 1481763927251 logs/nifi-app.log
>>> 2016-12-14 17:05:27,277 DEBUG [Timer-Driven Process Thread-36]
>>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 1
>>> 2016-12-14 17:05:27,357 DEBUG [Timer-Driven Process Thread-2]
>>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 2
>>> 2016-12-14 17:05:27,684 DEBUG [Timer-Driven Process Thread-36]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 1
>>> 2016-12-14 17:05:27,732 DEBUG [Timer-Driven Process Thread-2]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 0
>>> 2016-12-14 17:05:27,909 DEBUG [Timer-Driven Process Thread-14]
>>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 1
>>> 2016-12-14 17:05:27,945 DEBUG [Timer-Driven Process Thread-14]
>>> o.a.n.c.r.c.StandardResourceClaimManager Incrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 2
>>> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 1
>>> 2016-12-14 17:14:26,556 DEBUG [Timer-Driven Process Thread-14]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481763927251-1, container=default, section=1]
>>> to 0
>>>
>>> This nifi-app.log covers a period when the nifi only handled two sets of
>>> files for a total of maybe 10GB uncompressed. Content repository went
>>> over
>>> 100GB in that time. I checked a few content repository files, and they
>>> all
>>> had similar patterns - claims hit 0 twice - once around 17:05 and once
>>> around 17:14, then nothing. I brought down the nifi around 17:30.
>>>
>>> During that time, we did have a processor hitting OutOfMemory while
>>> unpacking a 1GB file. I'm adjusting the heap to try to make that succeed
>>> in
>>> case that was related.
>>>
>>> On Wed, Dec 14, 2016 at 8:32 PM, Mark Payne <marka...@hotmail.com
>>> <mailto:marka...@hotmail.com>> wrote:
>>>
>>> OK, so these are generally the default values for most linux systems.
>>> These are a little low,
>>> though for what NiFi recommends and often needs. With these settings, you
>>> can easily run
>>> out of open file handles. When this happens, trying to access a file will
>>> return a FileNotFoundException
>>> even though the file exists and permissions all look good. As a result,
>>> NiFi may be failing to
>>> delete the data simply because it can't get an open file handle.
>>>
>>> The admin guide [1] explains the best practices for configuring these
>>> settings. Generally, after updating
>>> these settings, I think you have to logout of the machine and login again
>>> for the changes to take effect.
>>> Would recommend you update these settings and also search logs for
>>> "FileNotFound" as well as
>>> "NoSuchFile" and see if that hits anywhere.
>>>
>>>
>>> [1] http://nifi.apache.org/docs/nifi-docs/html/administration-
>>> guide.html#configuration-best-practices
>>>
>>>
>>> On Dec 14, 2016, at 8:25 PM, Alan Jackoway <al...@cloudera.com<mailto:
>>> al...@cloudera.com><mailto:ala
>>> n...@cloudera.com<mailto:n...@cloudera.com>>> wrote:
>>>
>>> We haven't let the disk hit 100% in a while, but it's been crossing 90%.
>>> We
>>> haven't seen the "Unable to checkpoint" message in the last 24 hours.
>>>
>>> $ ulimit -Hn
>>> 4096
>>> $ ulimit -Sn
>>> 1024
>>>
>>> I will work on tracking a specific file next.
>>>
>>>
>>> On Wed, Dec 14, 2016 at 8:17 PM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com><mailto:
>>> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote:
>>>
>>> At first I thought that the drained messages always said 0, but that's
>>> not
>>> right. What should the total number of claims drained be? The number of
>>> flowfiles that made it through the system? If so, I think our number is
>>> low:
>>>
>>> $ grep "StandardResourceClaimManager Drained" nifi-app_2016-12-14*  |
>>> grep
>>> -v "Drained 0" | awk '{sum += $9} END {print sum}'
>>> 25296
>>>
>>> I'm not sure how to get the count of flowfiles that moved through, but I
>>> suspect that's low by an order of magnitude. That instance of nifi has
>>> handled 150k files in the last 6 hours, most of which went through a
>>> number
>>> of processors and transformations.
>>>
>>> Should the number of drained claims correspond to the number of flow
>>> files
>>> that moved through the system?
>>> Alan
>>>
>>> On Wed, Dec 14, 2016 at 6:59 PM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com><mailto:
>>> al...@cloudera.com<mailto:al...@cloudera.com>>> wrote:
>>>
>>> Some updates:
>>> * We fixed the issue with missing transfer relationships, and this did
>>> not go away.
>>> * We saw this a few minutes ago when the queue was at 0.
>>>
>>> What should I be looking for in the logs to figure out the issue?
>>>
>>> Thanks,
>>> Alan
>>>
>>> On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com>
>>> <mailto:al...@cloudera.com>>
>>> wrote:
>>>
>>> In case this is interesting, I think this started getting bad when we
>>> started hitting an error where some of our files were not given a
>>> transfer
>>> relationship. Maybe some combination of not giving flow files a
>>> relationship and the subsequent penalization is causing the problem.
>>>
>>> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com>
>>> <mailto:al...@cloudera.com>>
>>> wrote:
>>>
>>> Everything is at the default locations for these nifis.
>>>
>>> On one of the two machines, I did find log messages like you suggested:
>>> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile
>>> Repository due to java.io.FileNotFoundException:
>>> ./flowfile_repository/partition-14/3169.journal (No space left on
>>> device)
>>>
>>> I added the logger, which apparently takes effect right away. What am I
>>> looking for in this logs? I see a lot of stuff like:
>>> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555893660-3174, container=default,
>>> section=102] to 0
>>> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555922818-3275, container=default,
>>> section=203] to 191
>>> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555880393-3151, container=default,
>>> section=79] to 142
>>> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555872053-3146, container=default,
>>> section=74] to 441
>>> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555893954-3178, container=default,
>>> section=106] to 2
>>> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count
>>> for StandardResourceClaim[id=1481555893696-3175, container=default,
>>> section=103] to 1
>>> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1]
>>> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims
>>> to []
>>>
>>> What's puzzling to me is that both of these machines have > 100GB of
>>> free space, and I have never seen the queued size go above 20GB. It seems
>>> to me like it gets into a state where nothing is deleted long before it
>>> runs out of disk space.
>>>
>>> Thanks,
>>> Alan
>>>
>>> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <marka...@hotmail.com
>>> <mailto:marka...@hotmail.com><mailto:m
>>> arka...@hotmail.com<mailto:arka...@hotmail.com>>>
>>> wrote:
>>>
>>> Alan,
>>>
>>> Thanks for the thread-dump and the in-depth analysis!
>>>
>>> So in terms of the two tasks there, here's a quick explanation of what
>>> each does:
>>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which
>>> maps to a file on disk) is no longer referenced
>>> by any FlowFile, it can be either archived or destroyed (depending on
>>> whether the property in nifi.properties has archiving
>>> enabled).
>>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource
>>> Claims that are archived have to eventually
>>> age off. This task is responsible for ensuring that this happens.
>>>
>>> As you mentioned, in the Executor, if the Runnable fails it will stop
>>> running forever, and if the thread gets stuck, another will
>>> not be launched. Neither of these appears to be the case. I say this
>>> because both of those Runnables are wrapped entirely
>>> within a try { ... } catch (Throwable t) {...}. So the method will
>>> never end Exceptionally. Also, the thread dump shows all of the
>>> threads created by that Thread Pool (those whose names begin with
>>> "FileSystemRepository Workers Thread-") in WAITING
>>> or TIMED_WAITING state. This means that they are sitting in the
>>> Executor waiting to be scheduled to do something else,
>>> so they aren't stuck in any kind of infinite loop or anything like
>>> that.
>>>
>>> Now, with all of that being said, I have a theory as to what could
>>> perhaps be happening :)
>>>
>>> From the configuration that you listed below, it shows that the
>>> content repository is located at ./content_repository, which is
>>> the default. Is the FlowFile Repository also located at the default
>>> location of ./flowfile_repository? The reason that I ask is this:
>>>
>>> When I said above that a Resource Claim is marked destructible when no
>>> more FlowFiles reference it, that was a bit of a
>>> simplification. A more detailed explanation is this: when the FlowFile
>>> Repository is checkpointed (this happens every 2 minutes
>>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed"
>>> or "compacted" or however you like to refer to it). When this
>>> happens, we do an fsync() to ensure that the data is stored safely on
>>> disk. Only then do we actually mark a claim as destructible.
>>> This is done in order to ensure that if there is a power outage and a
>>> FlowFile Repository update wasn't completely flushed to disk,
>>> that we can recover. For instance, if the content of a FlowFile
>>> changes from Resource Claim A to Resource Claim B and as a result
>>> we delete Resource Claim A and then lose power, it's possible that the
>>> FlowFile Repository didn't flush that update to disk; as a result,
>>> on restart, we may still have that FlowFile pointing to Resource Claim
>>> A which is now deleted, so we would end up having data loss.
>>> This method of only deleting Resource Claims after the FlowFile
>>> Repository has been fsync'ed means that we know on restart that
>>> Resource Claim A won't still be referenced.
>>>
>>> So that was probably a very wordy, verbose description of what happens
>>> but I'm trying to make sure that I explain things adequately.
>>> So with that background... if you are storing your FlowFile Repository
>>> on the same volume as your Content Repository, the following
>>> could happen:
>>>
>>> At some point in time, enough data is queued up in your flow for you
>>> to run out of disk space. As a result, the FlowFile Repository is
>>> unable to be compacted. Since this is not happening, it will not mark
>>> any of the Resource Claims as destructible. This would mean that
>>> the Content Repository does not get cleaned up. So now you've got a
>>> full Content Repository and it's unable to clean up after itself, because
>>> no Resource Claims are getting marked as destructible.
>>>
>>> So to prove or disprove this theory, there are a few things that you
>>> can look at:
>>>
>>> Do you see the following anywhere in your logs: Unable to checkpoint
>>> FlowFile Repository
>>>
>>> If you add the following to your conf/logback.xml:
>>> <logger name="org.apache.nifi.controller.repository.claim.
>>> StandardResourceClaimManager"
>>> level="DEBUG" />
>>> Then that should allow you to see a DEBUG-level log message every time
>>> that a Resource Claim is marked destructible and every time
>>> that the Content Repository requests the collection of Destructible
>>> Claims ("Drained 100 destructable claims" for instance)
>>>
>>> Any of the logs related to those statements should be very valuable in
>>> determining what's going on.
>>>
>>> Thanks again for all of the detailed analysis. Hopefully we can get
>>> this all squared away and taken care of quickly!
>>>
>>> -Mark
>>>
>>>
>>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <al...@cloudera.com<mailto:
>>> al...@cloudera.com><mailto:ala
>>> n...@cloudera.com<mailto:n...@cloudera.com>><mailto:
>>> al...@cloudera.com<mailto:al...@cloudera.com><mailto:al...@cloudera.com>>>
>>> wrote:
>>>
>>> Here is what I have figured out so far.
>>>
>>> The cleanups are scheduled at https://github.com/apache/nifi
>>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra
>>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con
>>> troller/repository/FileSystemRepository.java#L232
>>>
>>> I'm not totally sure which one of those is the one that should be
>>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or
>>> DestroyExpiredArchiveClaims, both of which are in that class, and both of
>>> which are scheduled with scheduleWithFixedDelay. Based on docs at
>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre
>>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j
>>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
>>> if those methods fail once, they will stop running forever. Also if the
>>> thread got stuck it wouldn't launch a new one.
>>>
>>> I then hoped I would go into the logs, see a failure, and use it to
>>> figure out the issue.
>>>
>>> What I'm seeing instead is things like this, which comes from
>>> BinDestructableClaims:
>>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default]
>>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from
>>> archive for Container default; oldest Archive Date is now Sat Dec 10
>>> 22:09:53 PST 2016; container cleanup took 34266 millis
>>> that are somewhat frequent (as often as once per second, which is the
>>> scheduling frequency). Then, eventually, they just stop. Unfortunately
>>> there isn't an error message I can find that's killing these.
>>>
>>> At nifi startup, I see messages like this, which come from something
>>> (not sure what yet) calling the cleanup() method on FileSystemRepository:
>>> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.
>>> FileSystemRepository
>>> Found unknown file /home/cops/edh-bundle-extracto
>>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File
>>> System Repository; removing file
>>> I never see those after the initial cleanup that happens on restart.
>>>
>>> I attached a thread dump. I noticed at the top that there is a cleanup
>>> thread parked. I took 10 more thread dumps after this and in every one of
>>> them the cleanup thread was parked. That thread looks like it corresponds
>>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe
>>> that
>>> if the cleanup task I need were running, it would be in one of the
>>> FileSystemRepository Workers. However, in all of my thread dumps, these
>>> were always all parked.
>>>
>>> Attached one of the thread dumps.
>>>
>>> Thanks,
>>> Alan
>>>
>>>
>>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <marka...@hotmail.com
>>> <mailto:marka...@hotmail.com>> wrote:
>>> Alan,
>>>
>>>
>>> It's possible that you've run into some sort of bug that is preventing
>>>
>>> it from cleaning up the Content  Repository properly. While it's stuck
>>>
>>> in this state, could you capture a thread dump (bin/nifi.sh dump
>>> thread-dump.txt)?
>>>
>>> That would help us determine if there is something going on that is
>>>
>>> preventing the cleanup from happening.
>>>
>>>
>>> Thanks
>>>
>>> -Mark
>>>
>>>
>>> ________________________________
>>> From: Alan Jackoway <al...@cloudera.com<mailto:al...@cloudera.com>>
>>> Sent: Sunday, December 11, 2016 11:11 AM
>>> To: dev@nifi.apache.org<mailto:dev@nifi.apache.org>
>>> Subject: Re: Content Repository Cleanup
>>>
>>> This just filled up again even
>>> with nifi.content.repository.archive.enabled=false.
>>>
>>> On the node that is still alive, our queued flowfiles are 91 / 16.47
>>> GB,
>>> but the content repository directory is using 646 GB.
>>>
>>> Is there a property I can set to make it clean things up more
>>> frequently? I
>>> expected that once I turned archive enabled off, it would delete things
>>> from the content repository as soon as the flow files weren't queued
>>> anywhere. So far the only way I have found to reliably get nifi to
>>> clear
>>> out the content repository is to restart it.
>>>
>>> Our version string is the following, if that interests you:
>>> 11/26/2016 04:39:37 PST
>>> Tagged nifi-1.1.0-RC2
>>> From ${buildRevision} on branch ${buildBranch}
>>>
>>> Maybe we will go to the released 1.1 and see if that helps. Until then
>>> I'll
>>> be restarting a lot and digging into the code to figure out where this
>>> cleanup is supposed to happen. Any pointers on code/configs for that
>>> would
>>> be appreciated.
>>>
>>> Thanks,
>>> Alan
>>>
>>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <jgres...@gmail.com
>>> <mailto:jgres...@gmail.com>> wrote:
>>>
>>> No, in my scenario a server restart would not affect the content
>>> repository
>>> size.
>>>
>>> On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com>> wrote:
>>>
>>> If we were in the situation Joe G described, should we expect that
>>> when
>>> we
>>> kill and restart nifi it would clean everything up? That behavior
>>> has
>>> been
>>> consistent every time - when the disk hits 100%, we kill nifi,
>>> delete
>>> enough old content files to bring it back up, and before it bring
>>> the UI
>>> up
>>> it deletes things to get within the archive policy again. That
>>> sounds
>>> less
>>> like the files are stuck and more like it failed trying.
>>>
>>> For now I just turned off archiving, since we don't really need it
>>> for
>>> this use case.
>>>
>>> I attached a jstack from last night's failure, which looks pretty
>>> boring
>>> to me.
>>>
>>> On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <al...@cloudera.com
>>> <mailto:al...@cloudera.com>>
>>> wrote:
>>>
>>> The scenario Joe G describes is almost exactly what we are doing.
>>> We
>>> bring in large files and unpack them into many smaller ones. In
>>> the most
>>> recent iteration of this problem, I saw that we had many small
>>> files
>>> queued
>>> up at the time trouble was happening. We will try your suggestion
>>> to
>>> see if
>>> the situation improves.
>>>
>>> Thanks,
>>> Alan
>>>
>>> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <jgres...@gmail.com
>>> <mailto:jgres...@gmail.com>>
>>> wrote:
>>>
>>> Not sure if your scenario is related, but one of the NiFi devs
>>> recently
>>> explained to me that the files in the content repository are
>>> actually
>>> appended together with other flow file content (please correct
>>> me if
>>> I'm
>>> explaining it wrong).  That means if you have many small flow
>>> files in
>>> your
>>> current backlog, and several large flow files have recently left
>>> the
>>> flow,
>>> the large ones could still be hanging around in the content
>>> repository
>>> as
>>> long as the small ones are still there, if they're in the same
>>> appended
>>> files on disk.
>>>
>>> This scenario recently happened to us: we had a flow with ~20
>>> million
>>> tiny
>>> flow files queued up, and at the same time we were also
>>> processing a
>>> bunch
>>> of 1GB files, which left the flow quickly.  The content
>>> repository was
>>> much
>>> larger than what was actually being reported in the flow stats,
>>> and our
>>> disks were almost full.  On a hunch, I tried the following
>>> strategy:
>>> - MergeContent the tiny flow files using flow-file-v3 format (to
>>> capture
>>> all attributes)
>>> - MergeContent 10,000 of the packaged flow files using tar
>>> format for
>>> easier storage on disk
>>> - PutFile into a directory
>>> - GetFile from the same directory, but using back pressure from
>>> here on
>>> out
>>> (so that the flow simply wouldn't pull the same files from disk
>>> until
>>> it
>>> was really ready for them)
>>> - UnpackContent (untar them)
>>> - UnpackContent (turn them back into flow files with the original
>>> attributes)
>>> - Then do the processing they were originally designed for
>>>
>>> This had the effect of very quickly reducing the size of my
>>> content
>>> repository to very nearly the actual size I saw reported in the
>>> flow,
>>> and
>>> my disk usage dropped from ~95% to 50%, which is the configured
>>> content
>>> repository max usage percentage.  I haven't had any problems
>>> since.
>>>
>>> Hope this helps.
>>> Joe
>>>
>>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.w...@gmail.com
>>> <mailto:joe.w...@gmail.com>> wrote:
>>>
>>> Alan,
>>>
>>> That retention percentage only has to do with the archive of
>>> data
>>> which kicks in once a given chunk of content is no longer
>>> reachable
>>> by
>>> active flowfiles in the flow.  For it to grow to 100%
>>> typically would
>>> mean that you have data backlogged in the flow that account
>>> for that
>>> much space.  If that is certainly not the case for you then we
>>> need
>>> to
>>> dig deeper.  If you could do screenshots or share log files
>>> and stack
>>> dumps around this time those would all be helpful.  If the
>>> screenshots
>>> and such are too sensitive please just share as much as you
>>> can.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway <
>>> al...@cloudera.com<mailto:al...@cloudera.com>>
>>> wrote:
>>> One other note on this, when it came back up there were tons
>>> of
>>> messages
>>> like this:
>>>
>>> 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository.
>>> FileSystemRepository
>>> Found unknown file /path/to/content_repository/49
>>> 8/1481329796415-87538
>>> (1071114 bytes) in File System Repository; archiving file
>>>
>>> I haven't dug into what that means.
>>> Alan
>>>
>>> On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway <
>>> al...@cloudera.com<mailto:al...@cloudera.com>>
>>> wrote:
>>>
>>> Hello,
>>>
>>> We have a node on which nifi content repository keeps
>>> growing to
>>> use
>>> 100%
>>> of the disk. It's a relatively high-volume process. It
>>> chewed
>>> through
>>> more
>>> than 100GB in the three hours between when we first saw it
>>> hit
>>> 100%
>>> of
>>> the
>>> disk and when we just cleaned it up again.
>>>
>>> We are running nifi 1.1 for this. Our nifi.properties
>>> looked like
>>> this:
>>>
>>> nifi.content.repository.implementation=org.apache.
>>> nifi.controller.repository.FileSystemRepository
>>> nifi.content.claim.max.appendable.size=10 MB
>>> nifi.content.claim.max.flow.files=100
>>> nifi.content.repository.direct
>>> ory.default=./content_repository
>>> nifi.content.repository.archive.max.retention.period=12
>>> hours
>>> nifi.content.repository.archive.max.usage.percentage=50%
>>> nifi.content.repository.archive.enabled=true
>>> nifi.content.repository.always.sync=false
>>>
>>> I just bumped retention period down to 2 hours, but should
>>> max
>>> usage
>>> percentage protect us from using 100% of the disk?
>>>
>>> Unfortunately we didn't get jstacks on either failure. If
>>> it hits
>>> 100%
>>> again I will make sure to get that.
>>>
>>> Thanks,
>>> Alan
>>>
>>>
>>>
>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have
>>> plenty.  I
>>> have learned the secret of being content in any and every
>>> situation,
>>> whether well fed or hungry, whether living in plenty or in
>>> want.  I can
>>> do
>>> all this through him who gives me strength.    *-Philippians
>>> 4:12-13*
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have
>>> plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I
>>> can do
>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>>>
>>> <thread-dump.txt>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Content Repository Cleanup

Reply via email to