Author: frm
Date: Mon Nov 28 10:37:08 2016
New Revision: 1771702

URL: http://svn.apache.org/viewvc?rev=1771702&view=rev
Log:
OAK-5167 - Document garbage collection

Add a general description of garbage collection. Describe the generational
garbage collection algorithm. Enumerate the three phases of garbage collection.

Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/nodestore/segment/overview.md

Modified: 
jackrabbit/oak/trunk/oak-doc/src/site/markdown/nodestore/segment/overview.md
URL: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/nodestore/segment/overview.md?rev=1771702&r1=1771701&r2=1771702&view=diff
==============================================================================
--- 
jackrabbit/oak/trunk/oak-doc/src/site/markdown/nodestore/segment/overview.md 
(original)
+++ 
jackrabbit/oak/trunk/oak-doc/src/site/markdown/nodestore/segment/overview.md 
Mon Nov 28 10:37:08 2016
@@ -15,9 +15,62 @@
   limitations under the License.
 -->
 
-# Segment Node Store
+# Oak Segment Tar
 
-The Segment Node Store is an implementation of the Node Store that persists 
repository data on the file system.
+Oak Segment Tar is an implementation of the Node Store that stores repository 
data on the file system.
+
+* [Garbage Collection](#garbage-collection)
+    * [Generational Garbage Collection](#generational-garbage-collection)
+    * [Estimation, Compaction and Cleanup](#estimation-compaction-cleanup)
+* [Design](#design)
+
+## <a name="garbage-collection"/> Garbage Collection
+
+Garbage Collection is the set of processes and techniques employed by Oak 
Segment Tar to eliminate unused persisted data, thus limiting the memory and 
disk footprint of the system.
+Most of the operations on repository data generate a certain amount of garbage.
+This garbage is a byproduct of the repository operations and consists of 
leftover data that is not usable by the user.
+If left unchecked, this garbage would just pile up, consume disk space and 
pollute in-memory data structures.
+To avoid this, Oak Segment Tar defines garbage collection procedures to 
eliminate unnecessary data.
+
+### <a name="generational-garbage-collection"/> Generational Garbage Collection
+
+The process implemented by Oak Segment Tar to eliminate unnecessary data is a 
generational garbage collection algorithm.
+The idea behind this algorithm is that the system assigns a generation to 
every piece of data generated by the user.
+A generation is just a number that is monotonically increasing.
+
+When the system first starts, every piece of data created by the user belongs 
to the first generation.
+When garbage collection runs, a second generation is started.
+As soon as the second generation is in place, data from the first generation 
that is still used by the user is copied over to the second generation.
+From this moment on, new data will be assigned to the second generation.
+Now the system contains data from the first and the second generation, but 
only data from the second generation is used.
+The compaction algorithm can now remove every piece of data from the first 
generation.
+This removal is safe, because every piece of data that is still in use was 
copied to the second generation when garbage collection started.
+
+The process of creating a new generation, migrating data to the new generation 
and removing an old generation is usually referred to as a "garbage collection 
cycle".
+The system goes through many garbage collection cycles over its lifetime, 
where every cycle removes unused data from older generations.
+
+### <a name="estimation-compaction-cleanup"/> Estimation, Compaction and 
Cleanup
+
+While the previous section describes the idea behind garbage collection, this 
section introduces the building blocks on top of which garbage collection is 
implemented.
+Oak Segment Tar splits the garbage collection process in three phases: 
estimation, compaction and cleanup.
+
+Estimation is the first phase of garbage collection.
+In this phase, the system checks how much garbage is actually present in the 
system.
+If there is not enough garbage to justify the creation of a new generation, 
this phase is responsible of blocking the rest of the garbage collection 
process.
+If the output of this phase reports that the amount of garbage is beyond a 
certain threshold, the system creates a new generation and goes on with the 
next phase.
+
+Compaction executes after a new generation is created.
+The purpose of compaction is to identify data that is currently used by the 
user.
+Once the system has a clear picture of which pieces of data the user is 
currently using, everything is copied to the new generation.
+This phase might be very time consuming depending on the size of the 
repository.
+The bigger the repository, the more has to be copied to the new generation.
+
+Cleanup is the last phase of garbage collection and kicks in as soon as 
compaction is done.
+Once relevant data is safe in the new generation, old and unused data from a 
previous generation can be removed.
+This phase locates outdated pieces of data from one of the oldest generations 
and removes it from the system.
+This is the only phase where data is actually deleted and disk space is 
finally freed.
+
+## <a name="design"/> Design
 
 The Segment Node Store serializes repository data and stores it in a set of 
TAR files.
 These files are the most coarse-grained containers for the repository data.


Reply via email to