[
https://issues.apache.org/jira/browse/OAK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941849#comment-13941849
]
Chetan Mehrotra edited comment on OAK-1341 at 3/31/14 5:24 AM:
---------------------------------------------------------------
Based on past discussion with [~mreutegg] and [~tmueller] following areas needs
to be accounted by GC logic
*Garbage Types*
Currently revision garbage gets created under following areas
# Deleted documents - If a document is deleted it is currently not removed in
persistent layer. OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split
document can be of following types. Further all revision entries in the split
doc are older than the revision of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other
property history.
## SD2 - Document only contains entries for various properties which have got
updated over time
## SD3 - Document contains both {{_revision}} and property entries but had no
child when it was split
## SD4 - Document is an intermediate document created as part of cascading
split doc support (OAK-1342)
# Primary Document old revision - If a document is not split then also it might
contains old revision entries for properties and commits
Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there
revision are older.
*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish
between a failed/unmrged commit and old commit.
Further the GC logic also has to honour any checkpoints registered with the
NodeStore (OAK-1586)
Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there
revision are older.
was (Author: chetanm):
Based on past discussion with [~mreutegg] and [~tmueller] following areas needs
to be accounted by GC logic
*Garbage Types*
Currently revision garbage gets created under following areas
# Deleted documents - If a document is deleted it is currently not removed in
persistent layer. OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split
document can be of following types. Further all revision entries in the split
doc are older than the revision of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other
property history.
## SD2 - Document only contains entries for various properties which have got
updated over time
## SD3 - Document contains both {{_revision}} and property entries but had no
child when it was split
## SD4 - Document is an intermediate document created as part of cascading
split doc support (OAK-1342)
# Primary Document old revision - If a document is not split then also it might
contains old revision entries for properties and commits
Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there
revision are older.
*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish
between a failed/unmrged commit and old commit.
Further the GC logic also has to honour any checkpoints registered with the
NodeStore (OAK-1586)
So for now would aim for #1 and #1.2
> DocumentNodeStore: Implement revision garbage collection
> --------------------------------------------------------
>
> Key: OAK-1341
> URL: https://issues.apache.org/jira/browse/OAK-1341
> Project: Jackrabbit Oak
> Issue Type: Sub-task
> Components: mongomk
> Reporter: Thomas Mueller
> Assignee: Chetan Mehrotra
> Priority: Minor
> Fix For: 0.20
>
>
> For the MongoMK (as well as for other storage engines that are based on it),
> garbage collection is most easily implemented by iterating over all documents
> and removing unused entries (either whole documents, or data within the
> document).
> Iteration can be done in parallel (for example one process per shard), and it
> can be done in any order.
> The most efficient order is probably the id order; however, it might be
> better to iterate only over documents that were not changed recently, by
> using the index on the "_modified" property. That way we don't need to
> iterate over the whole repository over and over again, but just over those
> documents that were actually changed.
--
This message was sent by Atlassian JIRA
(v6.2#6252)