Good instinct -- here's what I get: nifi-app.log:2017-03-09 15:03:00,670 INFO [Distributed Cache Server Communications Thread: ac907dec-49a4-439e-99f5-1558f2358d87] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@40569408 checkpointed with *4262902* Records and 0 Swap Files in 256302 milliseconds (Stop-the-world time = 1378 milliseconds, Clear Edit Logs time = 19 millis), max Transaction ID 4263237
Looks like it's over 4.2 million records now. On Thu, Mar 9, 2017 at 3:13 PM, Mark Payne <[email protected]> wrote: > Joe, > > That definitely sounds like a bug causing the eviction to not happen. Can > you grep your logs for the phrase > "checkpointed with"? You should have a line that tells you how many > records were written to the Snapshot. > You will certainly see a few of these types of messages, though, because > you have 1 for the FlowFile Repository, > one for Local State Management, and another one for the > DistributedMapCacheServer. I am curious to see if > you see the log message indicating 3 million+ records also. > > Thanks > -Mark > > > > On Mar 8, 2017, at 7:13 PM, Joe Gresock <[email protected]> wrote: > > > > Looking through the PersistenceMapCache and SimpleMapCache, it seems like > > lots of these records should have been evicted by now. We're up to 3.1 > > million records on disk in the snapshot file. My understanding is that > > when wali.checkpoint() is called, it collapses all the DELETE records in > > the journaled log and removes them before writing the snapshot file. Is > > that accurate? > > > > I feel like something is not going quite right with the eviction process. > > I am using 1.1.1, though, and I have noticed that the PersistentMapCache > > has changed in [1], so I might apply that patch and try some more > > experiments. > > > > Would anyone be willing to try to replicate this behavior in NiFi 1.1.1? > > You should be able to do it as follows: > > Services: > > DistributedMapCacheServer, maximum cache entries = 100,000, FIFO > eviction, > > persistence directory specified > > DistributedMapCacheClientService, point to the same host and port > > > > Flow: > > GenerateFlowFile (randomize 1K binary files in batches of 10, schedule 10 > > threads) ->HashContent (md5) into hash.value -> DetectDuplicate with > > identifier = ${hash.value}, description = ., no age off, select your > cache > > client, cache identifier = true > > > > This should cause the snapshot file to exceed 100,000 keys pretty > quickly, > > and as far as I can tell, it never goes back down. This in itself is > not a > > problem, but when the cache gets really big, it tends to crash our > cluster > > when NiFi reloads it into memory. > > > > [1] https://issues.apache.org/jira/browse/NIFI-3214 > > > > > > On Wed, Mar 8, 2017 at 11:06 AM, Joe Gresock <[email protected]> wrote: > > > >> Thanks Bryan, I'll start looking through the PersistenceMapCache. This > >> morning I checked back and the snapshot file now has 2.9 million keys > in it. > >> > >> On Tue, Mar 7, 2017 at 4:39 PM, Bryan Bende <[email protected]> wrote: > >> > >>> Joe, > >>> > >>> I'm not that familiar with the persistence part of the DMCS, although > >>> I do know that it uses the write-ahead-log that is also used by the > >>> flow file repo. > >>> > >>> The code for PersistenceMapCache is here: > >>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/ > >>> nifi-standard-services/nifi-distributed-cache-services- > >>> bundle/nifi-distributed-cache-server/src/main/java/org/ > >>> apache/nifi/distributed/cache/server/map/PersistentMapCache.java > >>> > >>> It looks like the WAL is check-pointed during puts here: > >>> > >>> final long modCount = modifications.getAndIncrement(); > >>> if ( modCount > 0 && modCount % 100000 == 0 ) { > >>> wali.checkpoint(); > >>> } > >>> > >>> And during deletes here: > >>> > >>> final long modCount = modifications.getAndIncrement(); > >>> if (modCount > 0 && modCount % 1000 == 0) { > >>> wali.checkpoint(); > >>> } > >>> > >>> Not sure if that was intentional that put operations check point every > >>> 100k and and deletes check point every 1k. > >>> > >>> Maybe Mark or others could shed some light on why the snapshot is > >>> reaching 3GB in size. > >>> > >>> -Bryan > >>> > >>> > >>> On Tue, Mar 7, 2017 at 7:07 AM, Joe Gresock <[email protected]> > wrote: > >>>> Hi folks, > >>>> > >>>> Is there a technical description of how the DistributedMapCacheServer > >>>> (DMCS) persistence works? I've noticed the following on our cluster: > >>>> > >>>> - I have the DMCS configured on port 4557 as FIFO with max 100,000 > >>> entries, > >>>> and have specified a persistence directory > >>>> - I am using DetectDuplicate with the DMCS, and the individual key > >>> length > >>>> is 80 bytes, with a Description length of 1 byte. By my count, this > >>> should > >>>> result in a pure data size of 7.7MB. > >>>> - I notice that the snapshot file in the persistence directory appears > >>> to > >>>> continue growing past the 100,000 limit, though this may be expected > >>>> depending on the implementation. Since I know that the key will > contain > >>>> "json" in it, I can run the following command to count the number of > >>>> possible keys in the snapshot file (though I'm not sure if this is a > >>> good > >>>> way of measuring how many keys are actually cached): grep -oa json > >>> snapshot > >>>> | wc -l > >>>> - When the snapshot file reaches around 3GB, the DMCS has a hard time > >>>> staying up, and frequently becomes unreachable (netstat -tulpn | grep > >>> 4557 > >>>> shows nothing). At this point, in order to restore functionality I > >>> delete > >>>> the persistence directory and let it start over. > >>>> > >>>> So my main questions are: > >>>> - How are the snapshot and partition files structured, and how can I > >>>> estimate how many keys are actually cached at a given time? > >>>> - Is the described behavior indicative of the cache exceeding the > >>>> configured max number of keys? > >>>> > >>>> Thanks, > >>>> Joe > >>>> > >>>> -- > >>>> I know what it is to be in need, and I know what it is to have plenty. > >>> I > >>>> have learned the secret of being content in any and every situation, > >>>> whether well fed or hungry, whether living in plenty or in want. I > can > >>> do > >>>> all this through him who gives me strength. *-Philippians 4:12-13* > >>> > >> > >> > >> > >> -- > >> I know what it is to be in need, and I know what it is to have plenty. > I > >> have learned the secret of being content in any and every situation, > >> whether well fed or hungry, whether living in plenty or in want. I can > >> do all this through him who gives me strength. *-Philippians 4:12-13* > >> > > > > > > > > -- > > I know what it is to be in need, and I know what it is to have plenty. I > > have learned the secret of being content in any and every situation, > > whether well fed or hungry, whether living in plenty or in want. I can > do > > all this through him who gives me strength. *-Philippians 4:12-13* > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
