[ 
https://issues.apache.org/jira/browse/OAK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941849#comment-13941849
 ] 

Chetan Mehrotra edited comment on OAK-1341 at 3/31/14 5:24 AM:
---------------------------------------------------------------

Based on past discussion with [~mreutegg] and [~tmueller] following areas needs 
to be accounted by GC logic

*Garbage Types*
Currently revision garbage gets created under following areas

# Deleted documents - If a document is deleted it is currently not removed in 
persistent layer. OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split 
document can be of following types. Further all revision entries in the split 
doc are older than the revision of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other 
property history. 
## SD2 - Document only contains entries for various properties which have got 
updated over time
## SD3 - Document contains both {{_revision}} and property entries but had no 
child when it was split
## SD4 - Document is an intermediate document created as part of cascading 
split doc support (OAK-1342)
# Primary Document old revision - If a document is not split then also it might 
contains old revision entries for properties and commits

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there 
revision are older. 

*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish 
between a failed/unmrged commit and old commit.

Further the GC logic also has to honour any checkpoints registered with the 
NodeStore (OAK-1586)

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there 
revision are older. 


was (Author: chetanm):
Based on past discussion with [~mreutegg] and [~tmueller] following areas needs 
to be accounted by GC logic

*Garbage Types*
Currently revision garbage gets created under following areas

# Deleted documents - If a document is deleted it is currently not removed in 
persistent layer. OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split 
document can be of following types. Further all revision entries in the split 
doc are older than the revision of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other 
property history. 
## SD2 - Document only contains entries for various properties which have got 
updated over time
## SD3 - Document contains both {{_revision}} and property entries but had no 
child when it was split
## SD4 - Document is an intermediate document created as part of cascading 
split doc support (OAK-1342)
# Primary Document old revision - If a document is not split then also it might 
contains old revision entries for properties and commits

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there 
revision are older. 

*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish 
between a failed/unmrged commit and old commit.

Further the GC logic also has to honour any checkpoints registered with the 
NodeStore (OAK-1586)

So for now would aim for #1 and #1.2

> DocumentNodeStore: Implement revision garbage collection
> --------------------------------------------------------
>
>                 Key: OAK-1341
>                 URL: https://issues.apache.org/jira/browse/OAK-1341
>             Project: Jackrabbit Oak
>          Issue Type: Sub-task
>          Components: mongomk
>            Reporter: Thomas Mueller
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>             Fix For: 0.20
>
>
> For the MongoMK (as well as for other storage engines that are based on it), 
> garbage collection is most easily implemented by iterating over all documents 
> and removing unused entries (either whole documents, or data within the 
> document). 
> Iteration can be done in parallel (for example one process per shard), and it 
> can be done in any order. 
> The most efficient order is probably the id order; however, it might be 
> better to iterate only over documents that were not changed recently, by 
> using the index on the "_modified" property. That way we don't need to 
> iterate over the whole repository over and over again, but just over those 
> documents that were actually changed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to