[ 
https://issues.apache.org/jira/browse/JCR-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604996#comment-16604996
 ] 

ASF GitHub Bot commented on JCR-4369:
-------------------------------------

GitHub user woonsan opened a pull request:

    https://github.com/apache/jackrabbit/pull/61

    JCR-4369: Avoid S3 Incomplete Read Warning by elegant aborting

    AWS S3 SDK recommends to _abort_ the `S3ObjectInputStream` if not intended 
to consume the data because the http connection pool behavior (by consuming 
data on close to reuse the connection) of HttpClient under the hood could cause 
a big performance issue when reading big amount of data. By aborting, it's 
better to simply abort the underlying `HttpRequestBase` and kick out the 
connection from the pool from AWS S3 SDK perspective.
    
    In multi-threaded working environment (due to multiple requests and/or 
`proactiveCaching` mode of `CachingDataStore`), the **reading** and **storing** 
actions in `o.a.j.c.data.CachingDataStore.getStream(DataIdentifier)` results in 
falling in the **else** block of `o.a.j.core.data.LocalCache.store(String, 
InputStream)` while the file by the name could already exist when executing the 
**else** block. In that case, the `S3ObjectInputStream` is never read and 
aborted. As a result, 
`com.amazonaws.services.s3.internal.S3AbortableInputStream#close()` ends up 
complaining about non-aborted/non-read-fully input stream.
    
    Therefore, my fix includes the following:
    - `LocalCache` checks if the backend resource input stream is 
**abortable**. If **abortable**, it tries to _abort_ the backend resource 
stream. For this purpose, `BackendResourceAbortable` interface in 
jackrabbit-data is introduced.
    - `S3Backend` wraps the `S3ObjectInputStream` to implement 
`BackendResourceAbortable` by leveraging commons-io's `ProxyInputStream`.
    - Some unit tests.
    - Just FYI, also personally tested locally with S3 compatible system (ref: 
https://github.com/woonsanko/hippo-davstore-demo/tree/feature/vfs-file-system#option-4-using-the-aws-s3-datastore-instead-of-vfs-datastore)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/woonsan/jackrabbit feature/JCR-4369

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jackrabbit/pull/61.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #61
    
----
commit 77f5a97c494eaae1c4b4beaa5a65b16153748df1
Author: Woonsan Ko <woonsan@...>
Date:   2018-09-04T22:32:47Z

    JCR-4369: let LocalCache be able to abort if stream is never read.

commit 98c26cae2a13c31317fbd2df08cc23df8c57b8c7
Author: Woonsan Ko <woonsan@...>
Date:   2018-09-05T19:43:58Z

    JCR-4369: unit test for whether LocalCache aborts or not.

----


> Avoid S3 Incomplete Read Warning
> --------------------------------
>
>                 Key: JCR-4369
>                 URL: https://issues.apache.org/jira/browse/JCR-4369
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-aws-ext
>    Affects Versions: 2.16.3, 2.17.5
>            Reporter: Woonsan Ko
>            Priority: Minor
>
> While using S3DataStore, the following logs are observed occasionally:
> {noformat}
> WARN [com.amazonaws.services.s3.internal.S3AbortableInputStream.close():178] 
> Not all bytes were read from the S3ObjectInputStream,
> aborting HTTP connection. This is likely an error and may result in 
> sub-optimal behavior. Request only the bytes you need via a ranged 
> GET or drain the input stream after use.
> {noformat}
> The warning logs are being left not only by HTTP processing threads, but also 
> by background threads, which made me think of the possibility of some 
> 'issues' in {{S3DataStore}} implementation. Not just caused by a broken http 
> connection by client.
> By the way, this issue is not a major one as AWS toolkit seems to just give a 
> warning as _recommendation_ in that case, with closing the underlying 
> HttpRequest object properly. So, there's no issue in functionality for the 
> record. It's only about 'warning' message and possible sub-optimal http 
> request handling under the hood (in AWS toolkit side).
> After looking at the code, I noticed that 
> {{CachingDataStore#proactiveCaching}} is enabled by default, which means the 
> {{S3DataStore}} tries to _proactively_ download the binary content, 
> asynchronously in a new thread, even when accessing metadata through 
> {{#getLastModified(...) and #getLength(...).
> Anyway, the _minor_ problem is now, whenever the {{S3DataStore}} reads 
> content (in other words get an input stream on an {{S3Object}}, it is 
> recommended to _read_ all data or _abort_ the input stream. Just to _close_ 
> the input stream is not good enough in AWS SDK perspective, resulting in the 
> warning. See {{S3AbortableInputStream#close()}} method. \[1\]
> Therefore, some S3 related classes (such as 
> {{org.apache.jackrabbit.core.data.LocalCache#store(String, InputStream)}}, 
> {{CachingDataStore#getStream(DataIdentifier)}}, etc.) should be improved like 
> the following:
> - If local cache file doesn't exist or it's on purge mode, it works as it 
> does: Just copy everything to local cache file and close it.
> - Otherwise, it should {{abort}} the underlying {{S3ObjectInputStream}}.
> The issue is a known one in AWS toolkit. \[2,3\] It seems like clients using 
> the toolkit needs to _abort_ the input stream if it doesn't want to read data 
> fully.
> \[1\] 
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/S3AbortableInputStream.java#L174-L187
> \[2\] https://github.com/aws/aws-sdk-java/issues/1211
> \[3\] https://github.com/aws/aws-sdk-java/issues/1657



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to