Streaming API. was Re: [Binary Uploads] Do binary uploads go to disk before their next hop into the repository?

Ian Boston Mon, 01 Aug 2016 02:57:56 -0700

Hi,
Clint's question, assuming my response was correct, raises a second
question.


Should Sling support binary upload streaming without using intermediate
disk buffers?

ie
request -> Sling -> target persistence
response <- Sling

Not
Client -> Sling -> Local Disk Buffer.
                            Local Disk Buffer -> target persistence
response <- Sling

and Not, in the case of the Oak S3 DS
Client -> Sling -> Disk Buffer.
                            Disk Buffer -> Oak S3 Async Disk Buffer
response <- Sling
                                                   Oak S3 Async Disk Buffer
-> S3


I dont know if Streaming is possible in Sling via the SlingMainServlet
given the way in which the request is wrapped, but Commons Upload does have
a streaming API so the request Input stream or multipart part, can be
streamed directly to a Resource.adaptTo(InputStream.class), provided Sling
would allow it.

Streaming does require that sufficient information to perform final storage
precedes the stream in the request. (Auth headers, resource identification,
target resource name etc)

IIRC, the alternative for users at present is to write a custom servlet and
mount it as an OSGi servlet.

Best Regards
Ian


On 29 July 2016 at 18:56, Ian Boston <[email protected]> wrote:

> Hi,
>
> BTW: There is IIRC a 32 bit problem (2GB files) with http via some proxies
> that can be avoided by using Chunked transfer encoding as each transfer
> doesn't need to be large, hence a PUT with Chunked encoding will stream.
>
> More inline.
>
>
>
>
> On 29 July 2016 at 18:16, Clint Goudie-Nice <[email protected]> wrote:
>
>> Hello all,
>>
>> Do binary uploads (assets, etc) get written to a temp location before
>> being put into the repository, or are they streamed end-to-end for these 4
>> transfer types:
>>
>>
>> 1)      Mime Multipart uploads / form uploads
>>
>
> Multipart uploads > 256000 bytes are written to disk, using commons file
> upload ServletFileUpload[1] in [2] which produced FileItems which are then
> read. I think the InputStream from the FileInput is connected to the
> OutputStream of a jcd:data and the data pumped between the 2 in blocks.
>
> I cant find any evindence of Sling using the FIleUpload streaming API for
> multi part posts [3]
>
>
>>
>> 2)      Content-Transfer:Chunked uploads
>>
>
>
> This is a lower level transfer encoding handled by jetty, chunked encoding
> does not surface in the Servlet API (IIRC). When streaming it does allow
> response output and upload output to stream without knowing the content
> length, so Jetty uses it producing 1 chunk on every flush. I would expect a
> modern browser to use chunked encoding for uploads.
>
>
>> 3)      Plain binary uploads with a specified length header
>>
>
> PUT operations are handled by the Jackrabbit WebDav bundle. I am not
> familiar with the code but do remember sending large volumes of data
> through it in 2007 and not seeing heap or local file IO. [4] backs that up.
> I think
>
>
>>
>> 4)      Plain binary uploads with no specified length header
>>
>
> If the content length is missing and it;s not chunked encoding jetty will
> read until the socket is closed. There is no difference from a server point
> of view in how the request is handled.
>
>
>
>
>>
>> There are pros and cons to each approach. Obviously, if you stream it end
>> to end, if the client is uploading a large stream of data, you have to
>> maintain a session over a long period, possibly hours.
>>
>
> I assume you mean JCR session not http session.
> The request will be authenticated before streaming starts, so the session
> will be validated at the start of the request and close when the session is
> logged out, ie at the end of the request. (IIRC).
>
>
>>
>> If it is being streamed to a temporary location first, and then to the
>> repository, you require an additional write and an additional read of IO,
>> but potentially less session time.
>>
>
> The session time is the same regardless, but the time taken to upload will
> require more IO so the operation will take longer and there is no
> interleaving between request and stream to the underlying DS. If the
> networks are the same speed, then upload takes 2x the time. Since the
> session is created before the upload starts and before commons file upload
> processes the request the session is open for the enture request.
>
> There is no load on the underlying repository from a file upload, other
> than the metadata which is minimal. I mean in the sense that there wont be
> 1000s of Oak Documents being created during the upload, only a pointer to
> the DataStore and a handfull of nodes. Since thats a small commit it wont
> generate a branch.
>
> Obviously if you are using a MongoDB DS it will generate lots of blobs
> which will impact replication and other things.
> A S3 DS will not start sending the data until a second copy of the data is
> made into the S3 Async upload cache (assuming that's enabled) otherwise I
> think it will stream directly to the S3 API.
> FS DS is , well, FS.
>
>
>>
>> I would like to better understand the requirements on the system imposed
>> by these different upload types.
>>
>> Clint
>>
>
> HTH
> Best Regards
> Ian
>
> 1 https://commons.apache.org/proper/commons-fileupload/using.html
>
> 2 org.apache.sling.engine.impl.parameters.ParameterSupport#parseMultiPartPost
> 3 https://commons.apache.org/proper/commons-fileupload/streaming.html
> 4
> https://github.com/apache/jackrabbit/blob/b252e505e34b03638207e959aaafce7c480ebaaa/jackrabbit-webdav/src/main/java/org/apache/jackrabbit/webdav/server/AbstractWebdavServlet.java#L629
>

Streaming API. was Re: [Binary Uploads] Do binary uploads go to disk before their next hop into the repository?

Reply via email to