[jira] [Comment Edited] (OAK-1849) DataStore GC support for heterogeneous deployments using a shared datastore

Thomas Mueller (JIRA) Mon, 07 Jul 2014 02:36:14 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018616#comment-14018616
 ]


Thomas Mueller edited comment on OAK-1849 at 7/7/14 9:34 AM:
-------------------------------------------------------------

What you describe above is the solution we had for Jackrabbit 2.x data stores, 
to share data stores. For the FileDataStore, we used the lastModified field of 
the file. For large data stores, updating the field takes quite a long time, as 
the metadata of each file needs to be changed. In the past, this turned out to 
be a performance problem.

To speed up garbage collection, I suggest we use a slightly different mechanism 
(unless for cases where we share a datastore with a Jackrabbit 2.x repository):

# We use {{collectGarbage(boolean markOnly)}} - same as what you described 
above. If the flag is {{true}}, the list of used blob ids are written to a flat 
file in the root directory of the data store (using a random file name) during 
or at the end of the {{mark}} phase.
# If {{markOnly}} is {{false}}, the {{sweep()}} method needs to additionally 
check the root directory of the data store, and process all flat files stored 
there, combining the lists if there are multiple. Entries in the list(s) must 
not be deleted. At the end of the sweep phase, the processed files may be 
removed.



was (Author: tmueller):
What you describe above is the solution we had for Jackrabbit 2.x data stores, 
to share data stores. For the FileDataStore, we used the lastModified field of 
the file. For large data stores, updating the field takes quite a long time, as 
the metadata of each file needs to be changed. In the past, this turned out to 
be a performance problem.

To speed up garbage collection, I suggest we use a slightly different mechanism 
(unless for cases where we share a datastore with a Jackrabbit 2.x repository):

# We use {{collectGarbage(boolean markOnly)}} - same as what you described 
above. If the flag is {{true}}, the list of used blob ids are written to a flat 
file in the root directory of the data store (using a random file name) during 
or at the end of the {{mark}} phase.
# If {{markOnly}} if {{false}}, the {{sweep()}} method needs to additionally 
check the root directory of the data store, and process all flat files stored 
there, combining the lists if there are multiple. Entries in the list(s) must 
not be deleted. At the end of the sweep phase, the processed files may be 
removed.


> DataStore GC support for heterogeneous deployments using a shared datastore
> ---------------------------------------------------------------------------
>
>                 Key: OAK-1849
>                 URL: https://issues.apache.org/jira/browse/OAK-1849
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>            Reporter: Amit Jain
>
> If the deployment is such that there are 2 or more different instances with a 
> shared datastore, triggering Datastore GC from one instance will result in 
> blobs used by another instance getting deleted, causing data loss.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (OAK-1849) DataStore GC support for heterogeneous deployments using a shared datastore

Reply via email to