I agree, it would be valuable to stream it.

AFAIK, the problem is that during a form submission you don’t know the order of 
the form attributes, and the binary data can be mixed into the middle. Maybe 
these could be ordered so the file submission came last; making streaming of 
the file data without a temporary location possible.

Clint

On 8/1/16, 3:57 AM, "[email protected] on behalf of Ian Boston" 
<[email protected] on behalf of [email protected]> wrote:

    Hi,
    Clint's question, assuming my response was correct, raises a second
    question.
    
    Should Sling support binary upload streaming without using intermediate
    disk buffers?
    
    ie
    request -> Sling -> target persistence
    response <- Sling
    
    Not
    Client -> Sling -> Local Disk Buffer.
                                Local Disk Buffer -> target persistence
    response <- Sling
    
    and Not, in the case of the Oak S3 DS
    Client -> Sling -> Disk Buffer.
                                Disk Buffer -> Oak S3 Async Disk Buffer
    response <- Sling
                                                       Oak S3 Async Disk Buffer
    -> S3
    
    
    I dont know if Streaming is possible in Sling via the SlingMainServlet
    given the way in which the request is wrapped, but Commons Upload does have
    a streaming API so the request Input stream or multipart part, can be
    streamed directly to a Resource.adaptTo(InputStream.class), provided Sling
    would allow it.
    
    Streaming does require that sufficient information to perform final storage
    precedes the stream in the request. (Auth headers, resource identification,
    target resource name etc)
    
    IIRC, the alternative for users at present is to write a custom servlet and
    mount it as an OSGi servlet.
    
    Best Regards
    Ian
    
    
    On 29 July 2016 at 18:56, Ian Boston <[email protected]> wrote:
    
    > Hi,
    >
    > BTW: There is IIRC a 32 bit problem (2GB files) with http via some proxies
    > that can be avoided by using Chunked transfer encoding as each transfer
    > doesn't need to be large, hence a PUT with Chunked encoding will stream.
    >
    > More inline.
    >
    >
    >
    >
    > On 29 July 2016 at 18:16, Clint Goudie-Nice <[email protected]> wrote:
    >
    >> Hello all,
    >>
    >> Do binary uploads (assets, etc) get written to a temp location before
    >> being put into the repository, or are they streamed end-to-end for these 
4
    >> transfer types:
    >>
    >>
    >> 1)      Mime Multipart uploads / form uploads
    >>
    >
    > Multipart uploads > 256000 bytes are written to disk, using commons file
    > upload ServletFileUpload[1] in [2] which produced FileItems which are then
    > read. I think the InputStream from the FileInput is connected to the
    > OutputStream of a jcd:data and the data pumped between the 2 in blocks.
    >
    > I cant find any evindence of Sling using the FIleUpload streaming API for
    > multi part posts [3]
    >
    >
    >>
    >> 2)      Content-Transfer:Chunked uploads
    >>
    >
    >
    > This is a lower level transfer encoding handled by jetty, chunked encoding
    > does not surface in the Servlet API (IIRC). When streaming it does allow
    > response output and upload output to stream without knowing the content
    > length, so Jetty uses it producing 1 chunk on every flush. I would expect 
a
    > modern browser to use chunked encoding for uploads.
    >
    >
    >> 3)      Plain binary uploads with a specified length header
    >>
    >
    > PUT operations are handled by the Jackrabbit WebDav bundle. I am not
    > familiar with the code but do remember sending large volumes of data
    > through it in 2007 and not seeing heap or local file IO. [4] backs that 
up.
    > I think
    >
    >
    >>
    >> 4)      Plain binary uploads with no specified length header
    >>
    >
    > If the content length is missing and it;s not chunked encoding jetty will
    > read until the socket is closed. There is no difference from a server 
point
    > of view in how the request is handled.
    >
    >
    >
    >
    >>
    >> There are pros and cons to each approach. Obviously, if you stream it end
    >> to end, if the client is uploading a large stream of data, you have to
    >> maintain a session over a long period, possibly hours.
    >>
    >
    > I assume you mean JCR session not http session.
    > The request will be authenticated before streaming starts, so the session
    > will be validated at the start of the request and close when the session 
is
    > logged out, ie at the end of the request. (IIRC).
    >
    >
    >>
    >> If it is being streamed to a temporary location first, and then to the
    >> repository, you require an additional write and an additional read of IO,
    >> but potentially less session time.
    >>
    >
    > The session time is the same regardless, but the time taken to upload will
    > require more IO so the operation will take longer and there is no
    > interleaving between request and stream to the underlying DS. If the
    > networks are the same speed, then upload takes 2x the time. Since the
    > session is created before the upload starts and before commons file upload
    > processes the request the session is open for the enture request.
    >
    > There is no load on the underlying repository from a file upload, other
    > than the metadata which is minimal. I mean in the sense that there wont be
    > 1000s of Oak Documents being created during the upload, only a pointer to
    > the DataStore and a handfull of nodes. Since thats a small commit it wont
    > generate a branch.
    >
    > Obviously if you are using a MongoDB DS it will generate lots of blobs
    > which will impact replication and other things.
    > A S3 DS will not start sending the data until a second copy of the data is
    > made into the S3 Async upload cache (assuming that's enabled) otherwise I
    > think it will stream directly to the S3 API.
    > FS DS is , well, FS.
    >
    >
    >>
    >> I would like to better understand the requirements on the system imposed
    >> by these different upload types.
    >>
    >> Clint
    >>
    >
    > HTH
    > Best Regards
    > Ian
    >
    > 1 https://commons.apache.org/proper/commons-fileupload/using.html
    >
    > 2 
org.apache.sling.engine.impl.parameters.ParameterSupport#parseMultiPartPost
    > 3 https://commons.apache.org/proper/commons-fileupload/streaming.html
    > 4
    > 
https://github.com/apache/jackrabbit/blob/b252e505e34b03638207e959aaafce7c480ebaaa/jackrabbit-webdav/src/main/java/org/apache/jackrabbit/webdav/server/AbstractWebdavServlet.java#L629
    >
    

Reply via email to