Re: Content Repository Cleanup

Alan Jackoway Wed, 14 Dec 2016 16:00:07 -0800

Some updates:
* We fixed the issue with missing transfer relationships, and this did not
go away.
* We saw this a few minutes ago when the queue was at 0.


What should I be looking for in the logs to figure out the issue?

Thanks,
Alan

On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <[email protected]> wrote:

> In case this is interesting, I think this started getting bad when we
> started hitting an error where some of our files were not given a transfer
> relationship. Maybe some combination of not giving flow files a
> relationship and the subsequent penalization is causing the problem.
>
> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <[email protected]>
> wrote:
>
>> Everything is at the default locations for these nifis.
>>
>> On one of the two machines, I did find log messages like you suggested:
>> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1]
>> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile
>> Repository due to java.io.FileNotFoundException:
>> ./flowfile_repository/partition-14/3169.journal (No space left on device)
>>
>> I added the logger, which apparently takes effect right away. What am I
>> looking for in this logs? I see a lot of stuff like:
>> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555893660-3174, container=default,
>> section=102] to 0
>> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555922818-3275, container=default,
>> section=203] to 191
>> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555880393-3151, container=default,
>> section=79] to 142
>> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555872053-3146, container=default,
>> section=74] to 441
>> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555893954-3178, container=default,
>> section=106] to 2
>> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24]
>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>> StandardResourceClaim[id=1481555893696-3175, container=default,
>> section=103] to 1
>> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1]
>> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims
>> to []
>>
>> What's puzzling to me is that both of these machines have > 100GB of free
>> space, and I have never seen the queued size go above 20GB. It seems to me
>> like it gets into a state where nothing is deleted long before it runs out
>> of disk space.
>>
>> Thanks,
>> Alan
>>
>> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <[email protected]> wrote:
>>
>>> Alan,
>>>
>>> Thanks for the thread-dump and the in-depth analysis!
>>>
>>> So in terms of the two tasks there, here's a quick explanation of what
>>> each does:
>>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which maps
>>> to a file on disk) is no longer referenced
>>> by any FlowFile, it can be either archived or destroyed (depending on
>>> whether the property in nifi.properties has archiving
>>> enabled).
>>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource
>>> Claims that are archived have to eventually
>>> age off. This task is responsible for ensuring that this happens.
>>>
>>> As you mentioned, in the Executor, if the Runnable fails it will stop
>>> running forever, and if the thread gets stuck, another will
>>> not be launched. Neither of these appears to be the case. I say this
>>> because both of those Runnables are wrapped entirely
>>> within a try { ... } catch (Throwable t) {...}. So the method will never
>>> end Exceptionally. Also, the thread dump shows all of the
>>> threads created by that Thread Pool (those whose names begin with
>>> "FileSystemRepository Workers Thread-") in WAITING
>>> or TIMED_WAITING state. This means that they are sitting in the Executor
>>> waiting to be scheduled to do something else,
>>> so they aren't stuck in any kind of infinite loop or anything like that.
>>>
>>> Now, with all of that being said, I have a theory as to what could
>>> perhaps be happening :)
>>>
>>> From the configuration that you listed below, it shows that the content
>>> repository is located at ./content_repository, which is
>>> the default. Is the FlowFile Repository also located at the default
>>> location of ./flowfile_repository? The reason that I ask is this:
>>>
>>> When I said above that a Resource Claim is marked destructible when no
>>> more FlowFiles reference it, that was a bit of a
>>> simplification. A more detailed explanation is this: when the FlowFile
>>> Repository is checkpointed (this happens every 2 minutes
>>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" or
>>> "compacted" or however you like to refer to it). When this
>>> happens, we do an fsync() to ensure that the data is stored safely on
>>> disk. Only then do we actually mark a claim as destructible.
>>> This is done in order to ensure that if there is a power outage and a
>>> FlowFile Repository update wasn't completely flushed to disk,
>>> that we can recover. For instance, if the content of a FlowFile changes
>>> from Resource Claim A to Resource Claim B and as a result
>>> we delete Resource Claim A and then lose power, it's possible that the
>>> FlowFile Repository didn't flush that update to disk; as a result,
>>> on restart, we may still have that FlowFile pointing to Resource Claim A
>>> which is now deleted, so we would end up having data loss.
>>> This method of only deleting Resource Claims after the FlowFile
>>> Repository has been fsync'ed means that we know on restart that
>>> Resource Claim A won't still be referenced.
>>>
>>> So that was probably a very wordy, verbose description of what happens
>>> but I'm trying to make sure that I explain things adequately.
>>> So with that background... if you are storing your FlowFile Repository
>>> on the same volume as your Content Repository, the following
>>> could happen:
>>>
>>> At some point in time, enough data is queued up in your flow for you to
>>> run out of disk space. As a result, the FlowFile Repository is
>>> unable to be compacted. Since this is not happening, it will not mark
>>> any of the Resource Claims as destructible. This would mean that
>>> the Content Repository does not get cleaned up. So now you've got a full
>>> Content Repository and it's unable to clean up after itself, because
>>> no Resource Claims are getting marked as destructible.
>>>
>>> So to prove or disprove this theory, there are a few things that you can
>>> look at:
>>>
>>> Do you see the following anywhere in your logs: Unable to checkpoint
>>> FlowFile Repository
>>>
>>> If you add the following to your conf/logback.xml:
>>> <logger 
>>> name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager"
>>> level="DEBUG" />
>>> Then that should allow you to see a DEBUG-level log message every time
>>> that a Resource Claim is marked destructible and every time
>>> that the Content Repository requests the collection of Destructible
>>> Claims ("Drained 100 destructable claims" for instance)
>>>
>>> Any of the logs related to those statements should be very valuable in
>>> determining what's going on.
>>>
>>> Thanks again for all of the detailed analysis. Hopefully we can get this
>>> all squared away and taken care of quickly!
>>>
>>> -Mark
>>>
>>>
>>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <[email protected]<mailto:
>>> [email protected]>> wrote:
>>>
>>> Here is what I have figured out so far.
>>>
>>> The cleanups are scheduled at https://github.com/apache/nifi
>>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra
>>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con
>>> troller/repository/FileSystemRepository.java#L232
>>>
>>> I'm not totally sure which one of those is the one that should be
>>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or
>>> DestroyExpiredArchiveClaims, both of which are in that class, and both of
>>> which are scheduled with scheduleWithFixedDelay. Based on docs at
>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre
>>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j
>>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) if
>>> those methods fail once, they will stop running forever. Also if the thread
>>> got stuck it wouldn't launch a new one.
>>>
>>> I then hoped I would go into the logs, see a failure, and use it to
>>> figure out the issue.
>>>
>>> What I'm seeing instead is things like this, which comes from
>>> BinDestructableClaims:
>>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default]
>>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from archive
>>> for Container default; oldest Archive Date is now Sat Dec 10 22:09:53 PST
>>> 2016; container cleanup took 34266 millis
>>> that are somewhat frequent (as often as once per second, which is the
>>> scheduling frequency). Then, eventually, they just stop. Unfortunately
>>> there isn't an error message I can find that's killing these.
>>>
>>> At nifi startup, I see messages like this, which come from something
>>> (not sure what yet) calling the cleanup() method on FileSystemRepository:
>>> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.FileSystemRepository
>>> Found unknown file /home/cops/edh-bundle-extracto
>>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File
>>> System Repository; removing file
>>> I never see those after the initial cleanup that happens on restart.
>>>
>>> I attached a thread dump. I noticed at the top that there is a cleanup
>>> thread parked. I took 10 more thread dumps after this and in every one of
>>> them the cleanup thread was parked. That thread looks like it corresponds
>>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that
>>> if the cleanup task I need were running, it would be in one of the
>>> FileSystemRepository Workers. However, in all of my thread dumps, these
>>> were always all parked.
>>>
>>> Attached one of the thread dumps.
>>>
>>> Thanks,
>>> Alan
>>>
>>>
>>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> Alan,
>>>
>>>
>>> It's possible that you've run into some sort of bug that is preventing
>>>
>>> it from cleaning up the Content  Repository properly. While it's stuck
>>>
>>> in this state, could you capture a thread dump (bin/nifi.sh dump
>>> thread-dump.txt)?
>>>
>>> That would help us determine if there is something going on that is
>>>
>>> preventing the cleanup from happening.
>>>
>>>
>>> Thanks
>>>
>>> -Mark
>>>
>>>
>>> ________________________________
>>> From: Alan Jackoway <[email protected]<mailto:[email protected]>>
>>> Sent: Sunday, December 11, 2016 11:11 AM
>>> To: [email protected]<mailto:[email protected]>
>>> Subject: Re: Content Repository Cleanup
>>>
>>> This just filled up again even
>>> with nifi.content.repository.archive.enabled=false.
>>>
>>> On the node that is still alive, our queued flowfiles are 91 / 16.47 GB,
>>> but the content repository directory is using 646 GB.
>>>
>>> Is there a property I can set to make it clean things up more
>>> frequently? I
>>> expected that once I turned archive enabled off, it would delete things
>>> from the content repository as soon as the flow files weren't queued
>>> anywhere. So far the only way I have found to reliably get nifi to clear
>>> out the content repository is to restart it.
>>>
>>> Our version string is the following, if that interests you:
>>> 11/26/2016 04:39:37 PST
>>> Tagged nifi-1.1.0-RC2
>>> From ${buildRevision} on branch ${buildBranch}
>>>
>>> Maybe we will go to the released 1.1 and see if that helps. Until then
>>> I'll
>>> be restarting a lot and digging into the code to figure out where this
>>> cleanup is supposed to happen. Any pointers on code/configs for that
>>> would
>>> be appreciated.
>>>
>>> Thanks,
>>> Alan
>>>
>>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <[email protected]<mailto:
>>> [email protected]>> wrote:
>>>
>>> > No, in my scenario a server restart would not affect the content
>>> repository
>>> > size.
>>> >
>>> > On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> >
>>> > > If we were in the situation Joe G described, should we expect that
>>> when
>>> > we
>>> > > kill and restart nifi it would clean everything up? That behavior has
>>> > been
>>> > > consistent every time - when the disk hits 100%, we kill nifi, delete
>>> > > enough old content files to bring it back up, and before it bring
>>> the UI
>>> > up
>>> > > it deletes things to get within the archive policy again. That sounds
>>> > less
>>> > > like the files are stuck and more like it failed trying.
>>> > >
>>> > > For now I just turned off archiving, since we don't really need it
>>> for
>>> > > this use case.
>>> > >
>>> > > I attached a jstack from last night's failure, which looks pretty
>>> boring
>>> > > to me.
>>> > >
>>> > > On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <[email protected]
>>> <mailto:[email protected]>>
>>> > wrote:
>>> > >
>>> > >> The scenario Joe G describes is almost exactly what we are doing. We
>>> > >> bring in large files and unpack them into many smaller ones. In the
>>> most
>>> > >> recent iteration of this problem, I saw that we had many small files
>>> > queued
>>> > >> up at the time trouble was happening. We will try your suggestion to
>>> > see if
>>> > >> the situation improves.
>>> > >>
>>> > >> Thanks,
>>> > >> Alan
>>> > >>
>>> > >> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <[email protected]
>>> <mailto:[email protected]>>
>>> > wrote:
>>> > >>
>>> > >>> Not sure if your scenario is related, but one of the NiFi devs
>>> recently
>>> > >>> explained to me that the files in the content repository are
>>> actually
>>> > >>> appended together with other flow file content (please correct me
>>> if
>>> > I'm
>>> > >>> explaining it wrong).  That means if you have many small flow
>>> files in
>>> > >>> your
>>> > >>> current backlog, and several large flow files have recently left
>>> the
>>> > >>> flow,
>>> > >>> the large ones could still be hanging around in the content
>>> repository
>>> > as
>>> > >>> long as the small ones are still there, if they're in the same
>>> appended
>>> > >>> files on disk.
>>> > >>>
>>> > >>> This scenario recently happened to us: we had a flow with ~20
>>> million
>>> > >>> tiny
>>> > >>> flow files queued up, and at the same time we were also processing
>>> a
>>> > >>> bunch
>>> > >>> of 1GB files, which left the flow quickly.  The content repository
>>> was
>>> > >>> much
>>> > >>> larger than what was actually being reported in the flow stats,
>>> and our
>>> > >>> disks were almost full.  On a hunch, I tried the following
>>> strategy:
>>> > >>> - MergeContent the tiny flow files using flow-file-v3 format (to
>>> > capture
>>> > >>> all attributes)
>>> > >>> - MergeContent 10,000 of the packaged flow files using tar format
>>> for
>>> > >>> easier storage on disk
>>> > >>> - PutFile into a directory
>>> > >>> - GetFile from the same directory, but using back pressure from
>>> here on
>>> > >>> out
>>> > >>> (so that the flow simply wouldn't pull the same files from disk
>>> until
>>> > it
>>> > >>> was really ready for them)
>>> > >>> - UnpackContent (untar them)
>>> > >>> - UnpackContent (turn them back into flow files with the original
>>> > >>> attributes)
>>> > >>> - Then do the processing they were originally designed for
>>> > >>>
>>> > >>> This had the effect of very quickly reducing the size of my content
>>> > >>> repository to very nearly the actual size I saw reported in the
>>> flow,
>>> > and
>>> > >>> my disk usage dropped from ~95% to 50%, which is the configured
>>> content
>>> > >>> repository max usage percentage.  I haven't had any problems since.
>>> > >>>
>>> > >>> Hope this helps.
>>> > >>> Joe
>>> > >>>
>>> > >>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> > >>>
>>> > >>> > Alan,
>>> > >>> >
>>> > >>> > That retention percentage only has to do with the archive of data
>>> > >>> > which kicks in once a given chunk of content is no longer
>>> reachable
>>> > by
>>> > >>> > active flowfiles in the flow.  For it to grow to 100% typically
>>> would
>>> > >>> > mean that you have data backlogged in the flow that account for
>>> that
>>> > >>> > much space.  If that is certainly not the case for you then we
>>> need
>>> > to
>>> > >>> > dig deeper.  If you could do screenshots or share log files and
>>> stack
>>> > >>> > dumps around this time those would all be helpful.  If the
>>> > screenshots
>>> > >>> > and such are too sensitive please just share as much as you can.
>>> > >>> >
>>> > >>> > Thanks
>>> > >>> > Joe
>>> > >>> >
>>> > >>> > On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway <
>>> [email protected]<mailto:[email protected]>>
>>> > >>> wrote:
>>> > >>> > > One other note on this, when it came back up there were tons of
>>> > >>> messages
>>> > >>> > > like this:
>>> > >>> > >
>>> > >>> > > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository.
>>> > >>> > FileSystemRepository
>>> > >>> > > Found unknown file /path/to/content_repository/49
>>> > >>> 8/1481329796415-87538
>>> > >>> > > (1071114 bytes) in File System Repository; archiving file
>>> > >>> > >
>>> > >>> > > I haven't dug into what that means.
>>> > >>> > > Alan
>>> > >>> > >
>>> > >>> > > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway <
>>> [email protected]<mailto:[email protected]>>
>>> > >>> > wrote:
>>> > >>> > >
>>> > >>> > >> Hello,
>>> > >>> > >>
>>> > >>> > >> We have a node on which nifi content repository keeps growing
>>> to
>>> > use
>>> > >>> > 100%
>>> > >>> > >> of the disk. It's a relatively high-volume process. It chewed
>>> > >>> through
>>> > >>> > more
>>> > >>> > >> than 100GB in the three hours between when we first saw it hit
>>> > 100%
>>> > >>> of
>>> > >>> > the
>>> > >>> > >> disk and when we just cleaned it up again.
>>> > >>> > >>
>>> > >>> > >> We are running nifi 1.1 for this. Our nifi.properties looked
>>> like
>>> > >>> this:
>>> > >>> > >>
>>> > >>> > >> nifi.content.repository.implementation=org.apache.
>>> > >>> > >> nifi.controller.repository.FileSystemRepository
>>> > >>> > >> nifi.content.claim.max.appendable.size=10 MB
>>> > >>> > >> nifi.content.claim.max.flow.files=100
>>> > >>> > >> nifi.content.repository.directory.default=./content_reposito
>>> ry
>>> > >>> > >> nifi.content.repository.archive.max.retention.period=12 hours
>>> > >>> > >> nifi.content.repository.archive.max.usage.percentage=50%
>>> > >>> > >> nifi.content.repository.archive.enabled=true
>>> > >>> > >> nifi.content.repository.always.sync=false
>>> > >>> > >>
>>> > >>> > >> I just bumped retention period down to 2 hours, but should max
>>> > usage
>>> > >>> > >> percentage protect us from using 100% of the disk?
>>> > >>> > >>
>>> > >>> > >> Unfortunately we didn't get jstacks on either failure. If it
>>> hits
>>> > >>> 100%
>>> > >>> > >> again I will make sure to get that.
>>> > >>> > >>
>>> > >>> > >> Thanks,
>>> > >>> > >> Alan
>>> > >>> > >>
>>> > >>> >
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> --
>>> > >>> I know what it is to be in need, and I know what it is to have
>>> > plenty.  I
>>> > >>> have learned the secret of being content in any and every
>>> situation,
>>> > >>> whether well fed or hungry, whether living in plenty or in want.
>>> I can
>>> > >>> do
>>> > >>> all this through him who gives me strength.    *-Philippians
>>> 4:12-13*
>>> > >>>
>>> > >>
>>> > >>
>>> > >
>>> >
>>> >
>>> > --
>>> > I know what it is to be in need, and I know what it is to have
>>> plenty.  I
>>> > have learned the secret of being content in any and every situation,
>>> > whether well fed or hungry, whether living in plenty or in want.  I
>>> can do
>>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>>> >
>>>
>>> <thread-dump.txt>
>>>
>>>
>>
>

Re: Content Repository Cleanup

Reply via email to