Re: Content Repository Cleanup

Mark Payne Wed, 14 Dec 2016 17:17:07 -0800

Alan,

Some things that I would look into:


* Did you see the same ERROR as below: Unable to checkpoint FlowFile Repository 
due to ...?
  If so, did this occur before you ran out of disk space, or only after you ran 
out of disk space?

* What do you get when you run "ulimit -Hn" and "ulimit -Sn" on the command 
line?

* If you pick a few of the files from your content repo, do you see any log 
entries about that file
that indicate that the resource claim was marked destructible? For example, if 
you find a file
in the content repo such as $CONTENT_REPO/482/1481752353343-482 do you see a log
entry such as "Marking claim StandardResourceClaim[id=1481752353343-482 <... 
some other stuff here...>] as destructable"?
Do you see any logs for the filename? E.g., grep "1481752353343-482" 
logs/nifi-app*

This would help us narrow down whether the claims are available to be cleaned 
up but
cleanup isn't happening - or if something is preventing the claims from being 
marked available
for cleanup.

Thanks!
-Mark


> On Dec 14, 2016, at 6:59 PM, Alan Jackoway <[email protected]> wrote:
> 
> Some updates:
> * We fixed the issue with missing transfer relationships, and this did not
> go away.
> * We saw this a few minutes ago when the queue was at 0.
> 
> What should I be looking for in the logs to figure out the issue?
> 
> Thanks,
> Alan
> 
> On Mon, Dec 12, 2016 at 12:45 PM, Alan Jackoway <[email protected]> wrote:
> 
>> In case this is interesting, I think this started getting bad when we
>> started hitting an error where some of our files were not given a transfer
>> relationship. Maybe some combination of not giving flow files a
>> relationship and the subsequent penalization is causing the problem.
>> 
>> On Mon, Dec 12, 2016 at 12:16 PM, Alan Jackoway <[email protected]>
>> wrote:
>> 
>>> Everything is at the default locations for these nifis.
>>> 
>>> On one of the two machines, I did find log messages like you suggested:
>>> 2016-12-11 08:00:59,389 ERROR [pool-10-thread-1]
>>> o.a.n.c.r.WriteAheadFlowFileRepository Unable to checkpoint FlowFile
>>> Repository due to java.io.FileNotFoundException:
>>> ./flowfile_repository/partition-14/3169.journal (No space left on device)
>>> 
>>> I added the logger, which apparently takes effect right away. What am I
>>> looking for in this logs? I see a lot of stuff like:
>>> 2016-12-12 07:19:03,560 DEBUG [Timer-Driven Process Thread-24]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555893660-3174, container=default,
>>> section=102] to 0
>>> 2016-12-12 07:19:03,561 DEBUG [Timer-Driven Process Thread-31]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555922818-3275, container=default,
>>> section=203] to 191
>>> 2016-12-12 07:19:03,605 DEBUG [Timer-Driven Process Thread-8]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555880393-3151, container=default,
>>> section=79] to 142
>>> 2016-12-12 07:19:03,624 DEBUG [Timer-Driven Process Thread-38]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555872053-3146, container=default,
>>> section=74] to 441
>>> 2016-12-12 07:19:03,625 DEBUG [Timer-Driven Process Thread-25]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555893954-3178, container=default,
>>> section=106] to 2
>>> 2016-12-12 07:19:03,647 DEBUG [Timer-Driven Process Thread-24]
>>> o.a.n.c.r.c.StandardResourceClaimManager Decrementing claimant count for
>>> StandardResourceClaim[id=1481555893696-3175, container=default,
>>> section=103] to 1
>>> 2016-12-12 07:19:03,705 DEBUG [FileSystemRepository Workers Thread-1]
>>> o.a.n.c.r.c.StandardResourceClaimManager Drained 0 destructable claims
>>> to []
>>> 
>>> What's puzzling to me is that both of these machines have > 100GB of free
>>> space, and I have never seen the queued size go above 20GB. It seems to me
>>> like it gets into a state where nothing is deleted long before it runs out
>>> of disk space.
>>> 
>>> Thanks,
>>> Alan
>>> 
>>> On Mon, Dec 12, 2016 at 9:13 AM, Mark Payne <[email protected]> wrote:
>>> 
>>>> Alan,
>>>> 
>>>> Thanks for the thread-dump and the in-depth analysis!
>>>> 
>>>> So in terms of the two tasks there, here's a quick explanation of what
>>>> each does:
>>>> ArchiveOrDestroyDestructableClaims - When a Resource Claim (which maps
>>>> to a file on disk) is no longer referenced
>>>> by any FlowFile, it can be either archived or destroyed (depending on
>>>> whether the property in nifi.properties has archiving
>>>> enabled).
>>>> DestroyExpiredArchiveClaims - When archiving is enabled, the Resource
>>>> Claims that are archived have to eventually
>>>> age off. This task is responsible for ensuring that this happens.
>>>> 
>>>> As you mentioned, in the Executor, if the Runnable fails it will stop
>>>> running forever, and if the thread gets stuck, another will
>>>> not be launched. Neither of these appears to be the case. I say this
>>>> because both of those Runnables are wrapped entirely
>>>> within a try { ... } catch (Throwable t) {...}. So the method will never
>>>> end Exceptionally. Also, the thread dump shows all of the
>>>> threads created by that Thread Pool (those whose names begin with
>>>> "FileSystemRepository Workers Thread-") in WAITING
>>>> or TIMED_WAITING state. This means that they are sitting in the Executor
>>>> waiting to be scheduled to do something else,
>>>> so they aren't stuck in any kind of infinite loop or anything like that.
>>>> 
>>>> Now, with all of that being said, I have a theory as to what could
>>>> perhaps be happening :)
>>>> 
>>>> From the configuration that you listed below, it shows that the content
>>>> repository is located at ./content_repository, which is
>>>> the default. Is the FlowFile Repository also located at the default
>>>> location of ./flowfile_repository? The reason that I ask is this:
>>>> 
>>>> When I said above that a Resource Claim is marked destructible when no
>>>> more FlowFiles reference it, that was a bit of a
>>>> simplification. A more detailed explanation is this: when the FlowFile
>>>> Repository is checkpointed (this happens every 2 minutes
>>>> by default), its Write-Ahead Log is "rolled over" (or "checkpointed" or
>>>> "compacted" or however you like to refer to it). When this
>>>> happens, we do an fsync() to ensure that the data is stored safely on
>>>> disk. Only then do we actually mark a claim as destructible.
>>>> This is done in order to ensure that if there is a power outage and a
>>>> FlowFile Repository update wasn't completely flushed to disk,
>>>> that we can recover. For instance, if the content of a FlowFile changes
>>>> from Resource Claim A to Resource Claim B and as a result
>>>> we delete Resource Claim A and then lose power, it's possible that the
>>>> FlowFile Repository didn't flush that update to disk; as a result,
>>>> on restart, we may still have that FlowFile pointing to Resource Claim A
>>>> which is now deleted, so we would end up having data loss.
>>>> This method of only deleting Resource Claims after the FlowFile
>>>> Repository has been fsync'ed means that we know on restart that
>>>> Resource Claim A won't still be referenced.
>>>> 
>>>> So that was probably a very wordy, verbose description of what happens
>>>> but I'm trying to make sure that I explain things adequately.
>>>> So with that background... if you are storing your FlowFile Repository
>>>> on the same volume as your Content Repository, the following
>>>> could happen:
>>>> 
>>>> At some point in time, enough data is queued up in your flow for you to
>>>> run out of disk space. As a result, the FlowFile Repository is
>>>> unable to be compacted. Since this is not happening, it will not mark
>>>> any of the Resource Claims as destructible. This would mean that
>>>> the Content Repository does not get cleaned up. So now you've got a full
>>>> Content Repository and it's unable to clean up after itself, because
>>>> no Resource Claims are getting marked as destructible.
>>>> 
>>>> So to prove or disprove this theory, there are a few things that you can
>>>> look at:
>>>> 
>>>> Do you see the following anywhere in your logs: Unable to checkpoint
>>>> FlowFile Repository
>>>> 
>>>> If you add the following to your conf/logback.xml:
>>>> <logger 
>>>> name="org.apache.nifi.controller.repository.claim.StandardResourceClaimManager"
>>>> level="DEBUG" />
>>>> Then that should allow you to see a DEBUG-level log message every time
>>>> that a Resource Claim is marked destructible and every time
>>>> that the Content Repository requests the collection of Destructible
>>>> Claims ("Drained 100 destructable claims" for instance)
>>>> 
>>>> Any of the logs related to those statements should be very valuable in
>>>> determining what's going on.
>>>> 
>>>> Thanks again for all of the detailed analysis. Hopefully we can get this
>>>> all squared away and taken care of quickly!
>>>> 
>>>> -Mark
>>>> 
>>>> 
>>>> On Dec 11, 2016, at 1:21 PM, Alan Jackoway <[email protected]<mailto:
>>>> [email protected]>> wrote:
>>>> 
>>>> Here is what I have figured out so far.
>>>> 
>>>> The cleanups are scheduled at https://github.com/apache/nifi
>>>> /blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-fra
>>>> mework/nifi-framework-core/src/main/java/org/apache/nifi/con
>>>> troller/repository/FileSystemRepository.java#L232
>>>> 
>>>> I'm not totally sure which one of those is the one that should be
>>>> cleaning things up. It's either ArchiveOrDestroyDestructableClaims or
>>>> DestroyExpiredArchiveClaims, both of which are in that class, and both of
>>>> which are scheduled with scheduleWithFixedDelay. Based on docs at
>>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurre
>>>> nt/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(j
>>>> ava.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) if
>>>> those methods fail once, they will stop running forever. Also if the thread
>>>> got stuck it wouldn't launch a new one.
>>>> 
>>>> I then hoped I would go into the logs, see a failure, and use it to
>>>> figure out the issue.
>>>> 
>>>> What I'm seeing instead is things like this, which comes from
>>>> BinDestructableClaims:
>>>> 2016-12-10 23:08:50,117 INFO [Cleanup Archive for default]
>>>> o.a.n.c.repository.FileSystemRepository Deleted 159 files from archive
>>>> for Container default; oldest Archive Date is now Sat Dec 10 22:09:53 PST
>>>> 2016; container cleanup took 34266 millis
>>>> that are somewhat frequent (as often as once per second, which is the
>>>> scheduling frequency). Then, eventually, they just stop. Unfortunately
>>>> there isn't an error message I can find that's killing these.
>>>> 
>>>> At nifi startup, I see messages like this, which come from something
>>>> (not sure what yet) calling the cleanup() method on FileSystemRepository:
>>>> 2016-12-11 09:15:38,973 INFO [main] o.a.n.c.repository.FileSystemRepository
>>>> Found unknown file /home/cops/edh-bundle-extracto
>>>> r/content_repository/0/1481467667784-2048 (1749645 bytes) in File
>>>> System Repository; removing file
>>>> I never see those after the initial cleanup that happens on restart.
>>>> 
>>>> I attached a thread dump. I noticed at the top that there is a cleanup
>>>> thread parked. I took 10 more thread dumps after this and in every one of
>>>> them the cleanup thread was parked. That thread looks like it corresponds
>>>> to DestroyExpiredArchiveClaims, so I think it's incidental. I believe that
>>>> if the cleanup task I need were running, it would be in one of the
>>>> FileSystemRepository Workers. However, in all of my thread dumps, these
>>>> were always all parked.
>>>> 
>>>> Attached one of the thread dumps.
>>>> 
>>>> Thanks,
>>>> Alan
>>>> 
>>>> 
>>>> On Sun, Dec 11, 2016 at 12:17 PM, Mark Payne <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>> Alan,
>>>> 
>>>> 
>>>> It's possible that you've run into some sort of bug that is preventing
>>>> 
>>>> it from cleaning up the Content  Repository properly. While it's stuck
>>>> 
>>>> in this state, could you capture a thread dump (bin/nifi.sh dump
>>>> thread-dump.txt)?
>>>> 
>>>> That would help us determine if there is something going on that is
>>>> 
>>>> preventing the cleanup from happening.
>>>> 
>>>> 
>>>> Thanks
>>>> 
>>>> -Mark
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Alan Jackoway <[email protected]<mailto:[email protected]>>
>>>> Sent: Sunday, December 11, 2016 11:11 AM
>>>> To: [email protected]<mailto:[email protected]>
>>>> Subject: Re: Content Repository Cleanup
>>>> 
>>>> This just filled up again even
>>>> with nifi.content.repository.archive.enabled=false.
>>>> 
>>>> On the node that is still alive, our queued flowfiles are 91 / 16.47 GB,
>>>> but the content repository directory is using 646 GB.
>>>> 
>>>> Is there a property I can set to make it clean things up more
>>>> frequently? I
>>>> expected that once I turned archive enabled off, it would delete things
>>>> from the content repository as soon as the flow files weren't queued
>>>> anywhere. So far the only way I have found to reliably get nifi to clear
>>>> out the content repository is to restart it.
>>>> 
>>>> Our version string is the following, if that interests you:
>>>> 11/26/2016 04:39:37 PST
>>>> Tagged nifi-1.1.0-RC2
>>>> From ${buildRevision} on branch ${buildBranch}
>>>> 
>>>> Maybe we will go to the released 1.1 and see if that helps. Until then
>>>> I'll
>>>> be restarting a lot and digging into the code to figure out where this
>>>> cleanup is supposed to happen. Any pointers on code/configs for that
>>>> would
>>>> be appreciated.
>>>> 
>>>> Thanks,
>>>> Alan
>>>> 
>>>> On Sun, Dec 11, 2016 at 8:51 AM, Joe Gresock <[email protected]<mailto:
>>>> [email protected]>> wrote:
>>>> 
>>>>> No, in my scenario a server restart would not affect the content
>>>> repository
>>>>> size.
>>>>> 
>>>>> On Sun, Dec 11, 2016 at 8:46 AM, Alan Jackoway <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> If we were in the situation Joe G described, should we expect that
>>>> when
>>>>> we
>>>>>> kill and restart nifi it would clean everything up? That behavior has
>>>>> been
>>>>>> consistent every time - when the disk hits 100%, we kill nifi, delete
>>>>>> enough old content files to bring it back up, and before it bring
>>>> the UI
>>>>> up
>>>>>> it deletes things to get within the archive policy again. That sounds
>>>>> less
>>>>>> like the files are stuck and more like it failed trying.
>>>>>> 
>>>>>> For now I just turned off archiving, since we don't really need it
>>>> for
>>>>>> this use case.
>>>>>> 
>>>>>> I attached a jstack from last night's failure, which looks pretty
>>>> boring
>>>>>> to me.
>>>>>> 
>>>>>> On Sun, Dec 11, 2016 at 1:37 AM, Alan Jackoway <[email protected]
>>>> <mailto:[email protected]>>
>>>>> wrote:
>>>>>> 
>>>>>>> The scenario Joe G describes is almost exactly what we are doing. We
>>>>>>> bring in large files and unpack them into many smaller ones. In the
>>>> most
>>>>>>> recent iteration of this problem, I saw that we had many small files
>>>>> queued
>>>>>>> up at the time trouble was happening. We will try your suggestion to
>>>>> see if
>>>>>>> the situation improves.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Alan
>>>>>>> 
>>>>>>> On Sat, Dec 10, 2016 at 6:57 AM, Joe Gresock <[email protected]
>>>> <mailto:[email protected]>>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Not sure if your scenario is related, but one of the NiFi devs
>>>> recently
>>>>>>>> explained to me that the files in the content repository are
>>>> actually
>>>>>>>> appended together with other flow file content (please correct me
>>>> if
>>>>> I'm
>>>>>>>> explaining it wrong).  That means if you have many small flow
>>>> files in
>>>>>>>> your
>>>>>>>> current backlog, and several large flow files have recently left
>>>> the
>>>>>>>> flow,
>>>>>>>> the large ones could still be hanging around in the content
>>>> repository
>>>>> as
>>>>>>>> long as the small ones are still there, if they're in the same
>>>> appended
>>>>>>>> files on disk.
>>>>>>>> 
>>>>>>>> This scenario recently happened to us: we had a flow with ~20
>>>> million
>>>>>>>> tiny
>>>>>>>> flow files queued up, and at the same time we were also processing
>>>> a
>>>>>>>> bunch
>>>>>>>> of 1GB files, which left the flow quickly.  The content repository
>>>> was
>>>>>>>> much
>>>>>>>> larger than what was actually being reported in the flow stats,
>>>> and our
>>>>>>>> disks were almost full.  On a hunch, I tried the following
>>>> strategy:
>>>>>>>> - MergeContent the tiny flow files using flow-file-v3 format (to
>>>>> capture
>>>>>>>> all attributes)
>>>>>>>> - MergeContent 10,000 of the packaged flow files using tar format
>>>> for
>>>>>>>> easier storage on disk
>>>>>>>> - PutFile into a directory
>>>>>>>> - GetFile from the same directory, but using back pressure from
>>>> here on
>>>>>>>> out
>>>>>>>> (so that the flow simply wouldn't pull the same files from disk
>>>> until
>>>>> it
>>>>>>>> was really ready for them)
>>>>>>>> - UnpackContent (untar them)
>>>>>>>> - UnpackContent (turn them back into flow files with the original
>>>>>>>> attributes)
>>>>>>>> - Then do the processing they were originally designed for
>>>>>>>> 
>>>>>>>> This had the effect of very quickly reducing the size of my content
>>>>>>>> repository to very nearly the actual size I saw reported in the
>>>> flow,
>>>>> and
>>>>>>>> my disk usage dropped from ~95% to 50%, which is the configured
>>>> content
>>>>>>>> repository max usage percentage.  I haven't had any problems since.
>>>>>>>> 
>>>>>>>> Hope this helps.
>>>>>>>> Joe
>>>>>>>> 
>>>>>>>> On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>>> Alan,
>>>>>>>>> 
>>>>>>>>> That retention percentage only has to do with the archive of data
>>>>>>>>> which kicks in once a given chunk of content is no longer
>>>> reachable
>>>>> by
>>>>>>>>> active flowfiles in the flow.  For it to grow to 100% typically
>>>> would
>>>>>>>>> mean that you have data backlogged in the flow that account for
>>>> that
>>>>>>>>> much space.  If that is certainly not the case for you then we
>>>> need
>>>>> to
>>>>>>>>> dig deeper.  If you could do screenshots or share log files and
>>>> stack
>>>>>>>>> dumps around this time those would all be helpful.  If the
>>>>> screenshots
>>>>>>>>> and such are too sensitive please just share as much as you can.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Joe
>>>>>>>>> 
>>>>>>>>> On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway <
>>>> [email protected]<mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>>>> One other note on this, when it came back up there were tons of
>>>>>>>> messages
>>>>>>>>>> like this:
>>>>>>>>>> 
>>>>>>>>>> 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository.
>>>>>>>>> FileSystemRepository
>>>>>>>>>> Found unknown file /path/to/content_repository/49
>>>>>>>> 8/1481329796415-87538
>>>>>>>>>> (1071114 bytes) in File System Repository; archiving file
>>>>>>>>>> 
>>>>>>>>>> I haven't dug into what that means.
>>>>>>>>>> Alan
>>>>>>>>>> 
>>>>>>>>>> On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway <
>>>> [email protected]<mailto:[email protected]>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> We have a node on which nifi content repository keeps growing
>>>> to
>>>>> use
>>>>>>>>> 100%
>>>>>>>>>>> of the disk. It's a relatively high-volume process. It chewed
>>>>>>>> through
>>>>>>>>> more
>>>>>>>>>>> than 100GB in the three hours between when we first saw it hit
>>>>> 100%
>>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>> disk and when we just cleaned it up again.
>>>>>>>>>>> 
>>>>>>>>>>> We are running nifi 1.1 for this. Our nifi.properties looked
>>>> like
>>>>>>>> this:
>>>>>>>>>>> 
>>>>>>>>>>> nifi.content.repository.implementation=org.apache.
>>>>>>>>>>> nifi.controller.repository.FileSystemRepository
>>>>>>>>>>> nifi.content.claim.max.appendable.size=10 MB
>>>>>>>>>>> nifi.content.claim.max.flow.files=100
>>>>>>>>>>> nifi.content.repository.directory.default=./content_reposito
>>>> ry
>>>>>>>>>>> nifi.content.repository.archive.max.retention.period=12 hours
>>>>>>>>>>> nifi.content.repository.archive.max.usage.percentage=50%
>>>>>>>>>>> nifi.content.repository.archive.enabled=true
>>>>>>>>>>> nifi.content.repository.always.sync=false
>>>>>>>>>>> 
>>>>>>>>>>> I just bumped retention period down to 2 hours, but should max
>>>>> usage
>>>>>>>>>>> percentage protect us from using 100% of the disk?
>>>>>>>>>>> 
>>>>>>>>>>> Unfortunately we didn't get jstacks on either failure. If it
>>>> hits
>>>>>>>> 100%
>>>>>>>>>>> again I will make sure to get that.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alan
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>> plenty.  I
>>>>>>>> have learned the secret of being content in any and every
>>>> situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>> I can
>>>>>>>> do
>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> I know what it is to be in need, and I know what it is to have
>>>> plenty.  I
>>>>> have learned the secret of being content in any and every situation,
>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>> can do
>>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>> 
>>>> 
>>>> <thread-dump.txt>
>>>> 
>>>> 
>>> 
>>

Re: Content Repository Cleanup

Reply via email to