[ 
https://issues.apache.org/jira/browse/OAK-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-2882:
---------------------------------
    Attachment: build_datastore_list.sh
                OAK-2882.patch

[patch|^OAK-2882.patch] for the same which introduces a new 
{{DelegatingDataStore}} which can be configured like below

{code:xml}
<DataStore class="org.apache.jackrabbit.oak.upgrade.blob.DelegatingDataStore">
      <param name="recordSizeMappingFile" value="/path/to/mapping/file" />
      <param name="delegateClass" 
value="org.apache.jackrabbit.core.data.FileDataStore" />
</DataStore>
{code}

Where mapping file is like below
{noformat}
4432|dd10bca036f3134352c63e534d4568a3d2ac2fdc
32167|dd10bca036f3134567c63e534d4568a3d2ac2fdc
{noformat}

* For each record in DataStore there is a line in the mapping file
* First part denotes the length
* Second part denotes the blobId

The mapping file can be generated via following 
[script|^build_datastore_list.sh] (thanks to Jay Kerger) for say FileDataStore
{code}
#!/bin/bash
#
# This is a script to create a datastore mapping file with one line per file 
using the format:
#  <size in bytes of resource>|<resource ID>
#
#
# This script should be run at the root of the datastore folder

list="./datastore-list.txt"
rm $list
# need to loop across the first directory set or the massive find command will 
take to long
time for dir in `ls -1 ./`; do
  echo "working on $dir"
  time for file in `find $dir/ -type d`; do echo "`ls -l $file/ | grep -v total 
| grep -v "^d" | awk -v file="$file" '{print $5"|"$9}'`" >> $list; done
done
# remove blank lines caused from intermediate directories
cat $list | grep "|" > $list.tmp
mv $list.tmp $list
{code}

*Benefits*
* Allows one to perform and test migration without direct access to DataStore - 
In most setups its DataStore which takes most size. To test them out one can 
compress the basic required files (using command above) and then create the 
mapping file. Once that is done migration can be done without direct access to 
DataStore
* Speed up migration - If the DataStore is on a remote NFS or S3 access to it 
might slow down the migration. Having this mapping file created once would 
allow wasting time in that and hence allows one to try migration quickly in 
case some failure happens during migration

[~tmueller] [~jsedding] Can you have a look

> Support migration without access to DataStore
> ---------------------------------------------
>
>                 Key: OAK-2882
>                 URL: https://issues.apache.org/jira/browse/OAK-2882
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: upgrade
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.0, 1.0.15
>
>         Attachments: OAK-2882.patch, build_datastore_list.sh
>
>
> Migration currently involves access to DataStore as its configured as part of 
> repository.xml. However in complete migration actual binary content in 
> DataStore is not accessed and migration logic only makes use of
> * Dataidentifier = id of the files
> * Length = As it gets encoded as part of blobId (OAK-1667)
> It would be faster and beneficial to allow migration without actual access to 
> the DataStore. It would serve two benefits
> # Allows one to test out migration on local setup by just copying the TarPM 
> files. For e.g. one can only zip following files to get going with repository 
> startup if we can somehow avoid having direct access to DataStore
> {noformat}
> >crx-quickstart# tar -zcvf repo-2.tar.gz repository 
> >--exclude=repository/repository/datastore 
> >--exclude=repository/repository/index 
> >--exclude=repository/workspaces/crx.default/index 
> >--exclude=repository/tarJournal
> {noformat}
> # Provides faster (repeatable) migration as access to DataStore can be 
> avoided which in cases like S3 might be slow.  Given we solve how to get 
> length
> *Proposal*
> Have a DataStore implementation which can be provided a mapping file having 
> entries for blobId and length. This file would be used to answer queries 
> regarding length and existing of blob and thus would avoid actual access to 
> DataStore.
> Going further this DataStore can be configured with a delegate which can be 
> used as a fallback in case the required details is not present in pre 
> computed data set (may be due to change in content after that data was 
> computed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to