[ 
https://issues.apache.org/jira/browse/AMQ-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656923#comment-16656923
 ] 

Alan Protasio commented on AMQ-7080:
------------------------------------

I'm calculating a hash and saving it on metadata... (nowadays the hash is 
simply a XOR operation.... Maybe this is enough here...)

[https://github.com/alanprot/activemq/commit/de3b1ad9927ed20449c10afa687056322869ce00]

I'm also did some optimization in the Marshaller to try to do fewer writes... 
this speeded up a lot this operation decreasing the performance hit.... This 
still a WIP.. no tests yet because i'm still doing the optimization and 
measuring the performance hit. I will share the results when i have more 
concrete data...

I also created one more field in the index metadata (needsFreePageRecovery) to 
keep track if the recovery is needed. This is needed because as the recover is 
now being done in the shutdown and we are saving the db.free in the checkpoint. 
Thus, I can have a db.free that does no represent the whole free pages....

Imagine:

1 -> Free pages Size (1000)

2 -> Unclean Shutdown

3 -> db.free and db.data are out of sync

4 -> Activmeq start and allocate new pages (It cannot reuse db.free)

5 -> Activemq allocate more 200 pages

6 -> Unclean Shutdown. (db.free has 200 pages)

7 -> Activemq start and  db.free and db.data ARE in sync (I can reuse the 200 
free pages of the last run)

8 -> Clean Shutdown -> I should try to recovery the free pages and remount 1200 
free pages

We can recover on every unclean shutdown... but most of the cases this will not 
be needed.. 

 

 

> Keep track of free pages - Update db.free file during checkpoints
> -----------------------------------------------------------------
>
>                 Key: AMQ-7080
>                 URL: https://issues.apache.org/jira/browse/AMQ-7080
>             Project: ActiveMQ
>          Issue Type: Improvement
>          Components: KahaDB
>    Affects Versions: 5.15.6
>            Reporter: Alan Protasio
>            Priority: Major
>
> In a event of an unclean shutdown, Activemq loses the information about the 
> free pages in the index. In order to recover this information, ActiveMQ read 
> the whole index during shutdown searching for free pages and then save the 
> db.free file. This operation can take a long time, making the failover 
> slower. (during the shutdown, activemq will still hold the lock).
> From http://activemq.apache.org/shared-file-system-master-slave.html
> {quote}"If you have a SAN or shared file system it can be used to provide 
> high availability such that if a broker is killed, another broker can take 
> over immediately."
> {quote}
> Is important to note if the shutdown takes more than ACTIVEMQ_KILL_MAXSECONDS 
> seconds, any following shutdown will be unclean. This broker will stay in 
> this state unless the index is deleted (this state means that every failover 
> will take more then ACTIVEMQ_KILL_MAXSECONDS, so, if you increase this time 
> to 5 minutes, you fail over can take more than 5 minutes).
>  
> In order to prevent ActiveMQ reading the whole index file to search for free 
> pages, we can keep track of those on every Checkpoint. In order to do that we 
> need to be sure that db.data and db.free are in sync. To achieve that we can 
> have a attribute in the db.free page that is referenced by the db.data.
> So during the checkpoint we have:
> 1 - Save db.free and give a freePageUniqueId
> 2 - Save this freePageUniqueId in the db.data (metadata)
> In a crash, we can see if the db.data has the same freePageUniqueId as the 
> db.free. If this is the case we can safely use the free page information 
> contained in the db.free
> Now, the only way to read the whole index file again is IF the crash happens 
> btw step 1 and 2 (what is very unlikely).
> The drawback of this implementation is that we will have to save db.free 
> during the checkpoint, what can possibly increase the checkpoint time.
> Is also important to note that we CAN (and should) have stale data in db.free 
> as it is referencing stale db.data:
> Imagine the timeline:
> T0 -> P1, P2 and P3 are free.
> T1 -> Checkpoint
> T2 -> P1 got occupied.
> T3 -> Crash
> In the current scenario after the  Pagefile#load the P1 will be free and then 
> the replay will mark P1 as occupied or will occupied another page (now that 
> the recovery of free pages is done on shutdown)
> This change only make sure that db.data and db.free are in sync and showing 
> the reality in T1 (checkpoint), If they are in sync we can trust the db.free.
> This is a really fast draft of what i'm suggesting... If you guys agree, i 
> can create the proper patch after:
> [https://github.com/alanprot/activemq/commit/18036ef7214ef0eaa25c8650f40644dd8b4632a5]
>  
> This is related to https://issues.apache.org/jira/browse/AMQ-6590



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to