[jira] [Commented] (OAK-2882) Support migration without access to DataStore

Thomas Mueller (JIRA) Tue, 19 May 2015 00:18:12 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549960#comment-14549960
 ]


Thomas Mueller commented on OAK-2882:
-------------------------------------

As for the "//TODO For now using an in memory map. For very large repositories 
this might consume lots of memory.", I have an idea how to solve that: we could 
use a minimum perfect hash table, and only store the length and a few bytes of 
the keys. This would require only about 8 bytes per entry, so about 8 MB of 
heap memory per 1 million entries in the data store. The minimum perfect hash 
table needs 2 bits per key, then let's say 4 bytes per key for a fingerprint of 
the identifier (to detect, with very high probability, if there is a bug or 
missing entry), and 4 bytes for the length. If the length is larger than 4 
bytes (very very rare), we store -1, which means a file lookup is needed for 
those. I have [an 
implementation|https://github.com/h2database/h2database/blob/master/h2/src/tools/org/h2/dev/hash/MinimalPerfectHash.java]
 of the minimum perfect hash table. Instead of using a hash table, we could 
just keep a simple array, sorted by fingerprint.

> Support migration without access to DataStore
> ---------------------------------------------
>
>                 Key: OAK-2882
>                 URL: https://issues.apache.org/jira/browse/OAK-2882
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: upgrade
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: docs-impacting
>             Fix For: 1.3.0, 1.0.15
>
>         Attachments: OAK-2882-v2.patch, OAK-2882.patch, 
> build_datastore_list.sh
>
>
> Migration currently involves access to DataStore as its configured as part of 
> repository.xml. However in complete migration actual binary content in 
> DataStore is not accessed and migration logic only makes use of
> * Dataidentifier = id of the files
> * Length = As it gets encoded as part of blobId (OAK-1667)
> It would be faster and beneficial to allow migration without actual access to 
> the DataStore. It would serve two benefits
> # Allows one to test out migration on local setup by just copying the TarPM 
> files. For e.g. one can only zip following files to get going with repository 
> startup if we can somehow avoid having direct access to DataStore
> {noformat}
> >crx-quickstart# tar -zcvf repo-2.tar.gz repository 
> >--exclude=repository/repository/datastore 
> >--exclude=repository/repository/index 
> >--exclude=repository/workspaces/crx.default/index 
> >--exclude=repository/tarJournal
> {noformat}
> # Provides faster (repeatable) migration as access to DataStore can be 
> avoided which in cases like S3 might be slow.  Given we solve how to get 
> length
> *Proposal*
> Have a DataStore implementation which can be provided a mapping file having 
> entries for blobId and length. This file would be used to answer queries 
> regarding length and existing of blob and thus would avoid actual access to 
> DataStore.
> Going further this DataStore can be configured with a delegate which can be 
> used as a fallback in case the required details is not present in pre 
> computed data set (may be due to change in content after that data was 
> computed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-2882) Support migration without access to DataStore

Reply via email to