[jira] [Comment Edited] (JCR-3733) Asynchronous upload file to S3

Shashank Gupta (JIRA) Tue, 11 Mar 2014 21:47:31 -0700

    [ 
https://issues.apache.org/jira/browse/JCR-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906853#comment-13906853
 ]


Shashank Gupta edited comment on JCR-3733 at 3/12/14 4:45 AM:
--------------------------------------------------------------

h2. Specification 
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in 
local cache and then upload that file to S3 in a single synchronous step. This 
feature contemplates to break the current logic with synchronous adding to 
local cache and asynchronous uploading of the file to S3. Till asynchronous 
upload completes, all data (inputstream, length and lastModified) for that file 
record is fetched from local cache. 
AWS SDK provides [upload progress 
listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html]
 which provides various
callbacks on the status of in-progress upload. 

h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to 
S3. Once this limit is reached, the next upload to S3 is synchronous till one 
of asynchronous uploads completes. To disable this feature, set 
asyncUploadLimit  parameter to 0 in repository.xml. By default it is 100.

h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit 
deployment. It is possible that file is not fully uploaded to S3 before it is 
being accessed on other node.  For active-passive clustered mode, this feature 
requires to manually upload uncompleted asynchronous uploads to S3 after 
failover.
# If using this feature, it is strongly recommended  to NOT delete any file 
from local cache manually.   As local cache may contain files whose uploads are 
not completed to S3. 

h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous 
uploads. This class contains two data structures, one is \{@link Map<String, 
Long>\} of file path vs lastModified  to hold in-progress asynchronous uploads 
. The other is \{@link Set<String>\} of in-progress uploads which is marked for 
delete when asynchronous upload was in-progress. When asynchronous upload is 
initiated, an entry is added to this cache and when asynchronous upload 
completes, the corresponding entry is flushed.  Any modification to this cache 
is immediately serialized to filesytem in a synchronized code block. 

h3.Semantics of various DataStore and DataRecord APIs w.r.t  AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g. 
DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if 
dataIdentifier exists in S3 and else it returns null. It doesn't matter if 
dataIdentifier exists in local cache.   With this feature, S3 remains source of 
truth for completed uploads and AsyncUploadCache for in-progress asynchronous 
uploads.  

h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on 
asyncUploadLimit and current local cache size. If local cache advised to 
proceed with asynchronous upload, this method adds asynchronous upload entry to 
AsyncUploadCache and start asynchronous upload.  If no, it proceeds with 
synchronous upload to S3. If there is already asynchronous upload in-progress 
for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once 
asynchronous  uploads completes, the callback removes asynchronous  upload 
entry from AsyncUploadCache. 
 
h4. DataRecord  DataStore#getRecordIfStored(DataIdentifier)
 Return DataRecord if  asynchronous in-progress upload exists in 
AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0, 
timestamp is updated in AsyncUploadCache and S3. 
 
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
 For in-progress uploads, this method adds identifier to "toBeDeleted" set in 
AsyncUploadCache. When asynchronous  upload completes and invokes callback, the 
callback checks if asynchronous in-progress upload is marked for delete. If yes 
it invokes the deleteRecord to actually delete the record.
 
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache 
maintains map of in-progress asynchronous  uploads Vs lastModified, it marks 
asynchronous  uploads for delete whose lastModified < min. When asynchronous  
uploads completes and invokes callback, the callback checks if asynchronous 
in-progress upload is marked for delete. If yes it invokes the deleteRecord to 
actually delete the record.
 
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from 
AsyncUploadCache minus identifiers from the "toBeDeleted" set in 
AsyncUploadCache.

h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves 
length from S3 and cache it in local cache. 

h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from 
AsyncUploadCache else it is retrieved from S3 and cache it in local cache. 

h3. Behavior of Local cache Purge
The local cache has a  size limit, when currentsize of cache exceeds the limit, 
the cache undergoes auto-purge mode to clean older entries and reclaim space. 
During purging, local cache makes sure that it doesn’t delete any in-progress 
asynchronous upload file. 

h3. DataStore initialization behavior w.r.t. AsyncUploadCache 
It is possible that there are asynchronous  in-progress uploads when server 
shuts down.  When asynchronous upload is added to AsyncUploadCache  it is 
immediately persisted to filesytem on a file. During  S3DataStore's 
initialization it checks for any incomplete asynchronous  uploads and uploads 
them concurrently in multiple threads. It throws RepositoryException if file is 
not found in local cache for that asynchronous upload. As far as code is 
concerned, it is only possible when somebody has removed files from local cache 
manually.  If there is an exception and user want to proceed with 
inconsistencies, set parameter contOnAsyncUploadFailure to true in 
repository.xml. This will ignore all missing files and reset AsyncUploadCache.



was (Author: shgupta):
h2. Specification 
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in 
local cache and then upload that file to S3 in a single synchronous step. This 
feature contemplates to break the current logic with synchronous adding to 
local cache and asynchronous uploading of the file to S3. Till asynchronous 
upload completes, all data (inputstream, length and lastModified) for that file 
record is fetched from local cache. 
AWS SDK provides [upload progress 
listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html]
 which provides various
callbacks on the status of in-progress upload. 

h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to 
S3. Once this limit is reached, the next upload to S3 is synchronous till one 
of asynchronous uploads completes. To disable this feature, set 
asyncUploadLimit  parameter to 0 in repository.xml. By default it is 100.

h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit 
deployment. It is possible that file is not fully uploaded to S3 before it is 
being accessed on other node.  For active-passive clustered mode, this feature 
requires to manually upload uncompleted asynchronous uploads to S3 after 
failover.
# If using this feature, it is strongly recommended  to NOT delete any file 
from local cache manually.   As local cache may contain files whose uploads are 
not completed to S3. 

h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous 
uploads. This class contains two data structures, one is \{@link Map<String, 
Long>\} of file path vs lastModified  to hold in-progress asynchronous uploads 
. The other is \{@link Set<String>\} of in-progress uploads which is marked for 
delete when asynchronous upload was in-progress. When asynchronous upload is 
initiated, an entry is added to this cache and when asynchronous upload 
completes, the corresponding entry is flushed.  Any modification to this cache 
is immediately serialized to filesytem in a synchronized code block. 

h3.Semantics of various DataStore and DataRecord APIs w.r.t  AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g. 
DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if 
dataIdentifier exists in S3 and else it returns null. It doesn't matter if 
dataIdentifier exists in local cache.   With this feature, S3 remains source of 
truth for completed uploads and AsyncUploadCache for in-progress asynchronous 
uploads.  

h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on 
asyncUploadLimit and current local cache size. If local cache advised to 
proceed with asynchronous upload, this method adds asynchronous upload entry to 
AsyncUploadCache and start asynchronous upload.  If no, it proceeds with 
synchronous upload to S3. If there is already asynchronous upload in-progress 
for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once 
asynchronous  uploads completes, the callback removes asynchronous  upload 
entry from AsyncUploadCache. 
 
h4. DataRecord  DataStore#getRecordIfStored(DataIdentifier)
 Return DataRecord if  asynchronous in-progress upload exists in 
AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0, 
timestamp is updated in AsyncUploadCache and S3. 
 
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
 For in-progress uploads, this method adds identifier to "toBeDeleted" set in 
AsyncUploadCache. When asynchronous  upload completes and invokes callback, the 
callback checks if asynchronous in-progress upload is marked for delete. If yes 
it invokes the deleteRecord to actually delete the record.
 
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache 
maintains map of in-progress asynchronous  uploads Vs lastModified, it marks 
asynchronous  uploads for delete whose lastModified < min. When asynchronous  
uploads completes and invokes callback, the callback checks if asynchronous 
in-progress upload is marked for delete. If yes it invokes the deleteRecord to 
actually delete the record.
 
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from 
AsyncUploadCache minus identifiers from the "toBeDeleted" set in 
AsyncUploadCache.

h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves 
length from S3. 

h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from 
AsyncUploadCache else it is retrieved from S3. 

h3. Behavior of Local cache Purge
The local cache has a  size limit, when currentsize of cache exceeds the limit, 
the cache undergoes auto-purge mode to clean older entries and reclaim space. 
During purging, local cache makes sure that it doesn’t delete any in-progress 
asynchronous upload file. 

h3. DataStore initialization behavior w.r.t. AsyncUploadCache 
It is possible that there are asynchronous  in-progress uploads when server 
shuts down.  When asynchronous upload is added to AsyncUploadCache  it is 
immediately persisted to filesytem on a file. During  S3DataStore's 
initialization it checks for any incomplete asynchronous  uploads and uploads 
them concurrently in multiple threads. It throws RepositoryException if file is 
not found in local cache for that asynchronous upload. As far as code is 
concerned, it is only possible when somebody has removed files from local cache 
manually.  If there is an exception and user want to proceed with 
inconsistencies, set parameter contOnAsyncUploadFailure to true in 
repository.xml. This will ignore all missing files and reset AsyncUploadCache.


> Asynchronous upload file to S3
> ------------------------------
>
>                 Key: JCR-3733
>                 URL: https://issues.apache.org/jira/browse/JCR-3733
>             Project: Jackrabbit Content Repository
>          Issue Type: Sub-task
>          Components: jackrabbit-core
>            Reporter: Shashank Gupta
>             Fix For: 2.7.5
>
>
> S3DataStore Asynchronous Upload to S3
> The current logic to add a file record to S3DataStore is first add the file 
> in local cache and then upload that file to S3 in a single synchronous step. 
> This feature contemplates to break the current logic with synchronous adding 
> to local cache and asynchronous uploading of the file to S3. Till 
> asynchronous upload completes, all data (inputstream, length and 
> lastModified) for that file record is fetched from local cache. 
> AWS SDK provides upload progress listeners which provides various callbacks 
> on the status of in-progress upload.
> As of now customer reported that write performance of EBS based Datastore is 
> 3x  better than S3 DataStore. 
> With this feature, the objective is to have comparable write performance of 
> S3 DataStore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (JCR-3733) Asynchronous upload file to S3

Reply via email to