Hi Chen, The LSN only needs to indicate the last operation that was performed on the flushed disk component so we know where to begin recovery for that index. The flush operation is triggered only after the flush log record hits the disk and while we wait for that to happen, the state of the mutable index component will be READABLE_UNWRITABLE - this prevents any new writes going into the mutable component whose LSN will be greater than the FLUSH_LSN (the LSN when it was requested, not completed).
When the flush log record is persisted to disk, we can start the flush operation for the component but now we also switch to its shadow buffer and make that the current mutable component so that the ongoing flush operation does not block incoming writes. All new incoming writes with LSN > FLUSH_LSN will go into the second buffer and not the one that is flushing. When the flush operation is complete, we write the LSN corresponding to when the flush log was created (I'm not sure how this information is pushed down to LSMBTreeIOOperationCallback). Regardless of when a set of index flushes hit the disk, the corresponding LSN in the disk component will be the last operation that modified it and should not be when it was completed. The time between FLUSH_LSN and the current LSN when the component hits the disk, no operations are performed on the flushed component. If we set the LSN to the time when the component is flushed to disk, lets call this FLUSH_COMPLETE_LSN, we will miss looking at transaction log records generated between FLUSH_LSN and FLUSH_COMPLETE_LSN during recovery. So even though a set of flush operations complete writing to disk in an order different from the order in which the flush operation was requested, the timeline for what went into the disk components with respect to the transaction log will be consistent and we will not risk losing data between FLUSH_LSN and FLUSH_COMPLETE_LSN in case of failures by starting to scan at FLUSH_COMPLETE_LSN instead of FLUSH_LSN. I don't know if there are any other uses for the LSN in the disk component other than finding out when the recovery needs to start that would break this assumption. Hope that helps! On Fri, May 19, 2017 at 10:29 PM, Chen Luo <[email protected]> wrote: > Hi Devs, > > Recently I was using LSN to set a component ID for every newly flushed disk > component. From LSMBTreeIOOperationCallback (as well as other index > callbacks), I saw that after every flush operation, the LSN was placed at > the newly flushed disk component metadata. I was expecting that the LSN > should be increasing for every newly flushed disk component. That is, if a > disk component d1 is flushed later than another disk component d2, we > should have d1.LSN>d2.LSN. (please correct me if I'm wrong) > > However, based on my testing, this condition does not always hold. It is > possible that a later flushed disk component has a smaller LSN than the > previous flushed disk components (I found this by recording the previous > LSN, and throwing an exception when it is larger than the current LSN). Is > this behavior expected? Or we do not have the guarantee that LSNs placed at > flushed disk components are monotonic increasing? > > Any help is appreciated. > > Best regards, > Chen Luo >
