[
https://issues.apache.org/jira/browse/JCR-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906853#comment-13906853
]
Shashank Gupta edited comment on JCR-3733 at 3/12/14 4:45 AM:
--------------------------------------------------------------
h2. Specification
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in
local cache and then upload that file to S3 in a single synchronous step. This
feature contemplates to break the current logic with synchronous adding to
local cache and asynchronous uploading of the file to S3. Till asynchronous
upload completes, all data (inputstream, length and lastModified) for that file
record is fetched from local cache.
AWS SDK provides [upload progress
listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html]
which provides various
callbacks on the status of in-progress upload.
h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to
S3. Once this limit is reached, the next upload to S3 is synchronous till one
of asynchronous uploads completes. To disable this feature, set
asyncUploadLimit parameter to 0 in repository.xml. By default it is 100.
h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit
deployment. It is possible that file is not fully uploaded to S3 before it is
being accessed on other node. For active-passive clustered mode, this feature
requires to manually upload uncompleted asynchronous uploads to S3 after
failover.
# If using this feature, it is strongly recommended to NOT delete any file
from local cache manually. As local cache may contain files whose uploads are
not completed to S3.
h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous
uploads. This class contains two data structures, one is \{@link Map<String,
Long>\} of file path vs lastModified to hold in-progress asynchronous uploads
. The other is \{@link Set<String>\} of in-progress uploads which is marked for
delete when asynchronous upload was in-progress. When asynchronous upload is
initiated, an entry is added to this cache and when asynchronous upload
completes, the corresponding entry is flushed. Any modification to this cache
is immediately serialized to filesytem in a synchronized code block.
h3.Semantics of various DataStore and DataRecord APIs w.r.t AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g.
DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if
dataIdentifier exists in S3 and else it returns null. It doesn't matter if
dataIdentifier exists in local cache. With this feature, S3 remains source of
truth for completed uploads and AsyncUploadCache for in-progress asynchronous
uploads.
h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on
asyncUploadLimit and current local cache size. If local cache advised to
proceed with asynchronous upload, this method adds asynchronous upload entry to
AsyncUploadCache and start asynchronous upload. If no, it proceeds with
synchronous upload to S3. If there is already asynchronous upload in-progress
for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once
asynchronous uploads completes, the callback removes asynchronous upload
entry from AsyncUploadCache.
h4. DataRecord DataStore#getRecordIfStored(DataIdentifier)
Return DataRecord if asynchronous in-progress upload exists in
AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0,
timestamp is updated in AsyncUploadCache and S3.
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
For in-progress uploads, this method adds identifier to "toBeDeleted" set in
AsyncUploadCache. When asynchronous upload completes and invokes callback, the
callback checks if asynchronous in-progress upload is marked for delete. If yes
it invokes the deleteRecord to actually delete the record.
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache
maintains map of in-progress asynchronous uploads Vs lastModified, it marks
asynchronous uploads for delete whose lastModified < min. When asynchronous
uploads completes and invokes callback, the callback checks if asynchronous
in-progress upload is marked for delete. If yes it invokes the deleteRecord to
actually delete the record.
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from
AsyncUploadCache minus identifiers from the "toBeDeleted" set in
AsyncUploadCache.
h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves
length from S3 and cache it in local cache.
h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from
AsyncUploadCache else it is retrieved from S3 and cache it in local cache.
h3. Behavior of Local cache Purge
The local cache has a size limit, when currentsize of cache exceeds the limit,
the cache undergoes auto-purge mode to clean older entries and reclaim space.
During purging, local cache makes sure that it doesn’t delete any in-progress
asynchronous upload file.
h3. DataStore initialization behavior w.r.t. AsyncUploadCache
It is possible that there are asynchronous in-progress uploads when server
shuts down. When asynchronous upload is added to AsyncUploadCache it is
immediately persisted to filesytem on a file. During S3DataStore's
initialization it checks for any incomplete asynchronous uploads and uploads
them concurrently in multiple threads. It throws RepositoryException if file is
not found in local cache for that asynchronous upload. As far as code is
concerned, it is only possible when somebody has removed files from local cache
manually. If there is an exception and user want to proceed with
inconsistencies, set parameter contOnAsyncUploadFailure to true in
repository.xml. This will ignore all missing files and reset AsyncUploadCache.
was (Author: shgupta):
h2. Specification
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in
local cache and then upload that file to S3 in a single synchronous step. This
feature contemplates to break the current logic with synchronous adding to
local cache and asynchronous uploading of the file to S3. Till asynchronous
upload completes, all data (inputstream, length and lastModified) for that file
record is fetched from local cache.
AWS SDK provides [upload progress
listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html]
which provides various
callbacks on the status of in-progress upload.
h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to
S3. Once this limit is reached, the next upload to S3 is synchronous till one
of asynchronous uploads completes. To disable this feature, set
asyncUploadLimit parameter to 0 in repository.xml. By default it is 100.
h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit
deployment. It is possible that file is not fully uploaded to S3 before it is
being accessed on other node. For active-passive clustered mode, this feature
requires to manually upload uncompleted asynchronous uploads to S3 after
failover.
# If using this feature, it is strongly recommended to NOT delete any file
from local cache manually. As local cache may contain files whose uploads are
not completed to S3.
h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous
uploads. This class contains two data structures, one is \{@link Map<String,
Long>\} of file path vs lastModified to hold in-progress asynchronous uploads
. The other is \{@link Set<String>\} of in-progress uploads which is marked for
delete when asynchronous upload was in-progress. When asynchronous upload is
initiated, an entry is added to this cache and when asynchronous upload
completes, the corresponding entry is flushed. Any modification to this cache
is immediately serialized to filesytem in a synchronized code block.
h3.Semantics of various DataStore and DataRecord APIs w.r.t AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g.
DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if
dataIdentifier exists in S3 and else it returns null. It doesn't matter if
dataIdentifier exists in local cache. With this feature, S3 remains source of
truth for completed uploads and AsyncUploadCache for in-progress asynchronous
uploads.
h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on
asyncUploadLimit and current local cache size. If local cache advised to
proceed with asynchronous upload, this method adds asynchronous upload entry to
AsyncUploadCache and start asynchronous upload. If no, it proceeds with
synchronous upload to S3. If there is already asynchronous upload in-progress
for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once
asynchronous uploads completes, the callback removes asynchronous upload
entry from AsyncUploadCache.
h4. DataRecord DataStore#getRecordIfStored(DataIdentifier)
Return DataRecord if asynchronous in-progress upload exists in
AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0,
timestamp is updated in AsyncUploadCache and S3.
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
For in-progress uploads, this method adds identifier to "toBeDeleted" set in
AsyncUploadCache. When asynchronous upload completes and invokes callback, the
callback checks if asynchronous in-progress upload is marked for delete. If yes
it invokes the deleteRecord to actually delete the record.
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache
maintains map of in-progress asynchronous uploads Vs lastModified, it marks
asynchronous uploads for delete whose lastModified < min. When asynchronous
uploads completes and invokes callback, the callback checks if asynchronous
in-progress upload is marked for delete. If yes it invokes the deleteRecord to
actually delete the record.
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from
AsyncUploadCache minus identifiers from the "toBeDeleted" set in
AsyncUploadCache.
h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves
length from S3.
h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from
AsyncUploadCache else it is retrieved from S3.
h3. Behavior of Local cache Purge
The local cache has a size limit, when currentsize of cache exceeds the limit,
the cache undergoes auto-purge mode to clean older entries and reclaim space.
During purging, local cache makes sure that it doesn’t delete any in-progress
asynchronous upload file.
h3. DataStore initialization behavior w.r.t. AsyncUploadCache
It is possible that there are asynchronous in-progress uploads when server
shuts down. When asynchronous upload is added to AsyncUploadCache it is
immediately persisted to filesytem on a file. During S3DataStore's
initialization it checks for any incomplete asynchronous uploads and uploads
them concurrently in multiple threads. It throws RepositoryException if file is
not found in local cache for that asynchronous upload. As far as code is
concerned, it is only possible when somebody has removed files from local cache
manually. If there is an exception and user want to proceed with
inconsistencies, set parameter contOnAsyncUploadFailure to true in
repository.xml. This will ignore all missing files and reset AsyncUploadCache.
> Asynchronous upload file to S3
> ------------------------------
>
> Key: JCR-3733
> URL: https://issues.apache.org/jira/browse/JCR-3733
> Project: Jackrabbit Content Repository
> Issue Type: Sub-task
> Components: jackrabbit-core
> Reporter: Shashank Gupta
> Fix For: 2.7.5
>
>
> S3DataStore Asynchronous Upload to S3
> The current logic to add a file record to S3DataStore is first add the file
> in local cache and then upload that file to S3 in a single synchronous step.
> This feature contemplates to break the current logic with synchronous adding
> to local cache and asynchronous uploading of the file to S3. Till
> asynchronous upload completes, all data (inputstream, length and
> lastModified) for that file record is fetched from local cache.
> AWS SDK provides upload progress listeners which provides various callbacks
> on the status of in-progress upload.
> As of now customer reported that write performance of EBS based Datastore is
> 3x better than S3 DataStore.
> With this feature, the objective is to have comparable write performance of
> S3 DataStore.
--
This message was sent by Atlassian JIRA
(v6.2#6252)