Hi Chetan, Thanks for putting together such a detailed document. It's a collection of very interesting use cases, but I'm not sure if Oak is the right place to search for a solution to those problems. Let me explain in more detail.
Every use case you outlined requires Oak to expose the location of the binary objects in the underlying storage. As soon as a file path, a file descriptor or an S3 object ID traverses the boundary between Oak and its clients, all bets are off. Oak automatically loses ownership over that piece of data. Can a leaked binary object be garbage collected? Can it be moved around? Is it safe to access the binary object concurrently? Does Oak own a cached representation of that binary object that might be invalidated by the client? These are, in my opinion, more specific instances of the same question: is the correctness of Oak depending on the behaviour of the user? Regarding UC1 and UC2, I suppose that the client has some special binary objects that need to be treated in a special way. In UC1 the special binary objects are uploaded images that participate in a more complex workflow of conversion and rendition generation. In UC2 the special binary objects are some files that the client's organization wants to make accessible to a geographically distributed team leveraging S3's infrastructure. In my opinion, the solution to UC1 and UC2 is for the clients to recognize that those "special binary objects" are so special that they deserve special treatment on the client's side. Oak can be used to store references to those binary objects, but not the binary objects themselves. Similar considerations can be applied to UC3, UC5 UC6 and UC7 too. In UC4 you cite the zero copy support of Jetty and the design of Kafka as good examples of efficiency. The example holds until a certain point, though. Both Jetty and Kafka manage both the endpoints of the stream. Jetty and Kafka own both the files that have to be streamed and the socket to stream the file through. Instead Oak is bounded on top by the JCR specification, which acts as an intermediary between Oak and its users. A solution for UC4 that can be implemented today would involve Sling serving static files directly using a zero copy approach. The path of those files can be saved in Oak, of course. Similar considerations can be applied to UC8 too. While I would like to see these problems solved, I still don't believe that Oak is the right place to implement these solutions. 2016-06-01 9:30 GMT+02:00 Chetan Mehrotra <[email protected]>: > Hi Team, > > Recently we had a discussion around a new API proposal for binary access > [1]. From the discussion it was determined that we should first have a > collection of the kind of usecases which cannot be easily met by current > JCR Binary support in Oak so as to get better understanding of various > requirements. That would help us in coming up with a proper solution to > enable such usecases going forward > > To move forward on that I have tried to collect the various usecases at [2] > which I have seen in the past. > > UC1 - processing a binary in JCR with a native library that only has access > to the file system > UC2 - Efficient replication across regions in S3 > UC3 - Text Extraction without temporary File with Tika > UC4 - Spooling the binary content to socket output via NIO > UC5 - Transferring the file to FileDataStore with minimal overhead > UC6 - S3 import > UC7 - Random write access in binaries > UC8 - X-SendFile > > > I would like to get teams feedback on the various usecases and then come up > with the list of usecases which we would like to properly support in Oak. > > Once that is determined we can discuss the possible solutions and decide on > how it gets finally implemented. > > Kindly provide your feedback! > > Chetan Mehrotra > [1] http://markmail.org/thread/6mq4je75p64c5nyn > [2] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase >
