To highlight - As mentioned earlier the user of proposed api is tying itself to implementation details of Oak and if this changes later then that code would also need to be changed. Or as Ian summed it up
> if the API is introduced it should create an out of band agreement with the consumers of the API to act responsibly. The method is to be used for those important case where you do rely on implementation detail to get optimal performance in very specific scenarios. Its like DocumentNodeStore making use of some Mongo specific API to perform some important critical operation to achieve better performance by checking if the underlying DocumentStore is Mongo based. I have seen discussion of JCR-3534 and other related issue but still do not see any conclusion on how to answer such queries where direct access to blobs is required for performance aspect. This issue is not about exposing the blob reference for remote access but more about optimal path for in VM access > who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? The client code would need to be implemented in a proper way. Its more like implementing a CommitHook. If implemented in incorrect way it would cause issues deadlocks etc. But then we assume that any one implementing that interface would take proper care in implementation. > it limits implementation freedom and hinders further evolution (chunking, de-duplication, content based addressing, compression, gc, etc.) for data stores. As mentioned earlier. Some part of API indicates a closer dependency on how things work (like SPI, or ConsumerType AP on OSGi terms). By using such API client code definitely ties itself to Oak implementation detail but it should not limit how Oak implementation detail evolve. So when it changes client code need to adapt itself accordingly. Oak can express that by increment the minor version of exported package to indicate change in behavior. > bypassing JCR's security model I yet do not see the attack vector which we need to defend differently here. Again the blob url is not being exposed say as part of webdav or any other remote call. So would like to understand the security concern better here (unless it defending against a malicious , badly implemented client code which we discussed above) > Can't we come up with an API that allows the blobs to stay under control of Oak? The code need to work either at OS level say file handle or say S3 object. So I do not see a way where it can work without having access to those details FWIW there is code out there which reverse engineers the blobId to access the actual binary. People do it so as to get decent throughput in image rendition logic for large scale deployment. The proposal here was to formalize that approach by providing a proper api. If we do not provide such an API then the only way for them would be to continue relying on reverse engineering the blobId! > If not, this is probably an indication that those blobs shouldn't go into Oak but just references to it as Francesco already proposed. Anything else is whether fish nor fowl: you can't have the JCR goodies but at the same time access underlying resources at will. Thats a fine argument to make. But then users here have real problem to solve which we should not ignore. Oak based systems are being proposed for large asset deployment where one of the primary requirement is asset handling/processing of 100 of TB of binary data. So we would then have to recommend for such cases to not use JCR Binary abstraction and manage the binaries on your own. That would then solve both the problems (that might though break lots of tooling build on top of JCR API to manage those binaries)! Thinking more - Another approach that I can then suggest it people implement there own BlobStore (may be by extending ours) and provide this API there i.e. say which takes Blob id and provide the required details. This way we "outsource" the problem. Would that be acceptable? Chetan Mehrotra On Mon, May 9, 2016 at 2:28 PM, Michael Dürig <[email protected]> wrote: > > Hi, > > I very much share Francesco's concerns here. Unconditionally exposing > access to operation system resources underlying Oak's inner working is > troublesome for various reasons: > > - who owns the resource? Who coordinates (concurrent) access to it and > how? What are the correctness and performance implications here (races, > deadlock, corruptions, JCR semantics)? > > - it limits implementation freedom and hinders further evolution > (chunking, de-duplication, content based addressing, compression, gc, etc.) > for data stores. > > - bypassing JCR's security model > > Pretty much all of this has been discussed in the scope of > https://issues.apache.org/jira/browse/JCR-3534 and > https://issues.apache.org/jira/browse/OAK-834. So I suggest to review > those discussions before we jump to conclusion. > > > Also what is the use case requiring such a vast API surface? Can't we come > up with an API that allows the blobs to stay under control of Oak? If not, > this is probably an indication that those blobs shouldn't go into Oak but > just references to it as Francesco already proposed. Anything else is > whether fish nor fowl: you can't have the JCR goodies but at the same time > access underlying resources at will. > > Michael > > > > > On 5.5.16 11:00 , Francesco Mari wrote: > >> This proposal introduces a huge leak of abstractions and has deep security >> implications. >> >> I guess that the reason for this proposal is that some users of Oak would >> like to perform some operations on binaries in a more performant way by >> leveraging the way those binaries are stored. If this is the case, I >> suggest those users to evaluate an applicative solution implemented on top >> of the JCR API. >> >> If a user needs to store some important binary data (files, images, etc.) >> in an S3 bucket or on the file system for performance reasons, this >> shouldn't affect how Oak handles blobs internally. If some assets are of >> special interest for the user, then the user should bypass Oak and take >> care of the storage of those assets directly. Oak can be used to store >> *references* to those assets, that can be used in user code to manipulate >> the assets in his own business logic. >> >> If the scenario I outlined is not what inspired this proposal, I would >> like >> to know more about the reasons why this proposal was brought up. Which >> problems are we going to solve with this API? Is there a more concrete use >> case that we can use as a driving example? >> >> 2016-05-05 10:06 GMT+02:00 Davide Giannella <[email protected]>: >> >> On 04/05/2016 17:37, Ian Boston wrote: >>> >>>> Hi, >>>> If the File or URL is writable, will writing to the location cause >>>> issues >>>> for Oak ? >>>> IIRC some Oak DS implementations use a digest of the content to >>>> determine >>>> the location in the DS, so changing the content via Oak will change the >>>> location, but changing the content via the File or URL wont. If I didn't >>>> remember correctly, then ignore the concern. Fully supportive of the >>>> approach, as a consumer of Oak. The locations will certainly probably >>>> >>> leak >>> >>>> outside the context of an Oak session so the API contract should make it >>>> clear that the code using a direct location needs to behave responsibly. >>>> >>>> >>> It's a reasonable concern and I'm not in the details of the >>> implementation. It's worth to keep in mind though and remember if we >>> want to adapt to URL or File that maybe we'll have to come up with some >>> sort of read-only version of such. >>> >>> For the File class, IIRC, we could force/use the setReadOnly(), >>> setWritable() methods. I remember those to be quite expensive in time >>> though. >>> >>> Davide >>> >>> >>> >>> >>
