Re: Usecases around Binary handling in Oak

Michael Marth Thu, 11 Aug 2016 01:52:37 -0700

Hi,



>>I still don't believe that Oak is the right place to implement these
>>solutions.
>
>What would be the right place then? The Oak user can store the path of the
>file as a string, but he would lose some features (garbage collection for
>example).

I am in total agreement with Thomas. In my mind the use cases Chetan has 
outlined are clearly repository use cases. Delegating them to upper layer has 
severe negative consequences for applications that use Sling and Oak in 
conjunction. Other than the mentioned garbage collection I would like to 
mention that storing references only would mean:
* full text is broken for these binaries
* higher level manipulation of these binaries would have to be performed 
totally outside of the Sling/Oak stack. Think about XMP write back (or 
extraction of XMP meta data in the first place)
* versioning would be broken unless Oak would recognize special semantics in 
the reference property. However, that would render that reference property 
identical to the data store concept (which also stores are reference to an 
external binary) - effectively duplicating the data store concept.

In order to address these use cases I propose to add a new “OakBinary” type 
that inherits from the JCR binary and where we could implement the needed 
methods.
Potentially, it would also be required to add a new high-level concept 
“OakDatastore” exposed towards applications.

Going through the UC list:

UC1 - processing a binary in JCR with a native library that only has access
          to the file system


IIUC this has 2 parts:
1. Direct (non-copy) access to the binary within the DS. Various proposals for 
this have been mentioned on the thread “API proposal for - Expose URL for Blob 
source”
2. Assuming that the native process has persisted the output in the DS directly 
there still needs to be a node created the references that new binary. Not sure 
if a new method is needed for that part.

>UC2: already supported using references

This only half true:
For the use case to work one would have to have the possibility to move the 
binary via S3 copy (i.e. Copying the binary to the receiving S3 bucket - and 
then passing the reference).
I believe this comes down to the “API proposal for - Expose URL for Blob 
source” thread again.


>UC3: could be implemented with "fast random access reads" and changes in
>Tika.

IIUC then you suggest to add random (read) access to binaries - right? I think 
that would be very useful.

>
>UC4: could we add a method "writeTo(WritableByteChannel target)"?

+1


>
>UC5: The SHA-1 hash could be exposed if available, I don't see why not.
>Plus maybe UC1 or UC4.

Would this be sufficient to address UC5?
It sounds to me that direct write access to the data store from application 
level is needed.
(i.e. Exposing the DS to application level and have a method like addStream(…) )


>
>
>UC6: sounds like UC5
>
>UC7: we would need details (how many writes, do we need a new identifier
>for each write operation,...). Can be implemented quite efficiently for
>the BlobStore implementations (MongoBlobStore / RDBBlobStore /
>FileBlobStore).

Not sure why you need details on “how many writes”. But to give another example 
(other than the one mentioned in wiki page):
Consider a large file in the repo (say a video) and an application writing XMP 
metadata to that file (i.e. Modifying only a very small portion).

Cheers
Michael


>
>
>>As soon as a file path, a file
>>descriptor or an S3 object ID traverses the boundary between Oak and its
>>clients, all bets are off.
>
>Well, we would need to define the exact contract, and maybe access rights.
>
>> is the correctness of Oak depending on the behaviour of the user?
>
>To some extend, this is already the case.
>
>Regards,
>Thomas
>

Re: Usecases around Binary handling in Oak

Reply via email to