Hi Chetan,

Thanks for putting together such a detailed document. It's a collection of
very interesting use cases, but I'm not sure if Oak is the right place to
search for a solution to those problems. Let me explain in more detail.

Every use case you outlined requires Oak to expose the location of the
binary objects in the underlying storage. As soon as a file path, a file
descriptor or an S3 object ID traverses the boundary between Oak and its
clients, all bets are off. Oak automatically loses ownership over that
piece of data. Can a leaked binary object be garbage collected? Can it be
moved around? Is it safe to access the binary object concurrently? Does Oak
own a cached representation of that binary object that might be invalidated
by the client?

These are, in my opinion, more specific instances of the same question: is
the correctness of Oak depending on the behaviour of the user?

Regarding UC1 and UC2, I suppose that the client has some special binary
objects that need to be treated in a special way. In UC1 the special binary
objects are uploaded images that participate in a more complex workflow of
conversion and rendition generation. In UC2 the special binary objects are
some files that the client's organization wants to make accessible to a
geographically distributed team leveraging S3's infrastructure. In my
opinion, the solution to UC1 and UC2 is for the clients to recognize that
those "special binary objects" are so special that they deserve special
treatment on the client's side. Oak can be used to store references to
those binary objects, but not the binary objects themselves. Similar
considerations can be applied to UC3, UC5 UC6 and UC7 too.

In UC4 you cite the zero copy support of Jetty and the design of Kafka as
good examples of efficiency. The example holds until a certain point,
though. Both Jetty and Kafka manage both the endpoints of the stream. Jetty
and Kafka own both the files that have to be streamed and the socket to
stream the file through. Instead Oak is bounded on top by the JCR
specification, which acts as an intermediary between Oak and its users. A
solution for UC4 that can be implemented today would involve Sling serving
static files directly using a zero copy approach. The path of those files
can be saved in Oak, of course. Similar considerations can be applied to
UC8 too.

While I would like to see these problems solved, I still don't believe that
Oak is the right place to implement these solutions.

2016-06-01 9:30 GMT+02:00 Chetan Mehrotra <[email protected]>:

> Hi Team,
>
> Recently we had a discussion around a new API proposal for binary access
> [1]. From the discussion it was determined that we should first have a
> collection of the kind of usecases which cannot be easily met by current
> JCR Binary support in Oak so as to get better understanding of various
> requirements. That would help us in coming up with a proper solution to
> enable such usecases going forward
>
> To move forward on that I have tried to collect the various usecases at [2]
> which I have seen in the past.
>
> UC1 - processing a binary in JCR with a native library that only has access
>           to the file system
> UC2 - Efficient replication across regions in S3
> UC3 - Text Extraction without temporary File with Tika
> UC4 - Spooling the binary content to socket output via NIO
> UC5 - Transferring the file to FileDataStore with minimal overhead
> UC6 - S3 import
> UC7 - Random write access in binaries
> UC8 - X-SendFile
>
>
> I would like to get teams feedback on the various usecases and then come up
> with the list of usecases which we would like to properly support in Oak.
>
> Once that is determined we can discuss the possible solutions and decide on
> how it gets finally implemented.
>
> Kindly provide your feedback!
>
> Chetan Mehrotra
> [1] http://markmail.org/thread/6mq4je75p64c5nyn
> [2] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase
>

Reply via email to