Re: API proposal for - Expose URL for Blob source (OAK-1963)

Chetan Mehrotra Mon, 09 May 2016 02:44:04 -0700

To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up

> if the API is introduced it should create an out of band agreement with
the consumers of the API to act responsibly.

The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.

I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access

> who owns the resource? Who coordinates (concurrent) access to it and how?
What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.

>  it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.

> bypassing JCR's security model

I yet do not see the attack vector which we need to defend differently
here. Again the blob url is not being exposed say as part of webdav or any
other remote call. So would like to understand the security concern better
here (unless it defending against a malicious , badly implemented client
code which we discussed above)

> Can't we come up with an API that allows the blobs to stay under control
of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details

FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!

> If not, this is probably an indication that those blobs shouldn't go into
Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems are being proposed for
large asset deployment where one of the primary requirement is asset
handling/processing of 100 of TB of binary data. So we would then have to
recommend for such cases to not use JCR Binary abstraction and manage the
binaries on your own. That would then solve both the problems (that might
though break lots of tooling build on top of JCR API to manage those
binaries)!

Thinking more - Another approach that I can then suggest it people
implement there own BlobStore (may be by extending ours) and provide this
API there i.e. say which takes Blob id and provide the required details.
This way we "outsource" the problem. Would that be acceptable?

Chetan Mehrotra

On Mon, May 9, 2016 at 2:28 PM, Michael Dürig <[email protected]> wrote:

>
> Hi,
>
> I very much share Francesco's concerns here. Unconditionally exposing
> access to operation system resources underlying Oak's inner working is
> troublesome for various reasons:
>
> - who owns the resource? Who coordinates (concurrent) access to it and
> how? What are the correctness and performance implications here (races,
> deadlock, corruptions, JCR semantics)?
>
> - it limits implementation freedom and hinders further evolution
> (chunking, de-duplication, content based addressing, compression, gc, etc.)
> for data stores.
>
> - bypassing JCR's security model
>
> Pretty much all of this has been discussed in the scope of
> https://issues.apache.org/jira/browse/JCR-3534 and
> https://issues.apache.org/jira/browse/OAK-834. So I suggest to review
> those discussions before we jump to conclusion.
>
>
> Also what is the use case requiring such a vast API surface? Can't we come
> up with an API that allows the blobs to stay under control of Oak? If not,
> this is probably an indication that those blobs shouldn't go into Oak but
> just references to it as Francesco already proposed. Anything else is
> whether fish nor fowl: you can't have the JCR goodies but at the same time
> access underlying resources at will.
>
> Michael
>
>
>
>
> On 5.5.16 11:00 , Francesco Mari wrote:
>
>> This proposal introduces a huge leak of abstractions and has deep security
>> implications.
>>
>> I guess that the reason for this proposal is that some users of Oak would
>> like to perform some operations on binaries in a more performant way by
>> leveraging the way those binaries are stored. If this is the case, I
>> suggest those users to evaluate an applicative solution implemented on top
>> of the JCR API.
>>
>> If a user needs to store some important binary data (files, images, etc.)
>> in an S3 bucket or on the file system for performance reasons, this
>> shouldn't affect how Oak handles blobs internally. If some assets are of
>> special interest for the user, then the user should bypass Oak and take
>> care of the storage of those assets directly. Oak can be used to store
>> *references* to those assets, that can be used in user code to manipulate
>> the assets in his own business logic.
>>
>> If the scenario I outlined is not what inspired this proposal, I would
>> like
>> to know more about the reasons why this proposal was brought up. Which
>> problems are we going to solve with this API? Is there a more concrete use
>> case that we can use as a driving example?
>>
>> 2016-05-05 10:06 GMT+02:00 Davide Giannella <[email protected]>:
>>
>> On 04/05/2016 17:37, Ian Boston wrote:
>>>
>>>> Hi,
>>>> If the File or URL is writable, will writing to the location cause
>>>> issues
>>>> for Oak ?
>>>> IIRC some Oak DS implementations use a digest of the content to
>>>> determine
>>>> the location in the DS, so changing the content via Oak will change the
>>>> location, but changing the content via the File or URL wont. If I didn't
>>>> remember correctly, then ignore the concern.  Fully supportive of the
>>>> approach, as a consumer of Oak. The locations will certainly probably
>>>>
>>> leak
>>>
>>>> outside the context of an Oak session so the API contract should make it
>>>> clear that the code using a direct location needs to behave responsibly.
>>>>
>>>>
>>> It's a reasonable concern and I'm not in the details of the
>>> implementation. It's worth to keep in mind though and remember if we
>>> want to adapt to URL or File that maybe we'll have to come up with some
>>> sort of read-only version of such.
>>>
>>> For the File class, IIRC, we could force/use the setReadOnly(),
>>> setWritable() methods. I remember those to be quite expensive in time
>>> though.
>>>
>>> Davide
>>>
>>>
>>>
>>>
>>

Re: API proposal for - Expose URL for Blob source (OAK-1963)

Reply via email to