Hi,
>>I still don't believe that Oak is the right place to implement these
>>solutions.
>
>What would be the right place then? The Oak user can store the path of the
>file as a string, but he would lose some features (garbage collection for
>example).
I am in total agreement with Thomas. In my mind the use cases Chetan has
outlined are clearly repository use cases. Delegating them to upper layer has
severe negative consequences for applications that use Sling and Oak in
conjunction. Other than the mentioned garbage collection I would like to
mention that storing references only would mean:
* full text is broken for these binaries
* higher level manipulation of these binaries would have to be performed
totally outside of the Sling/Oak stack. Think about XMP write back (or
extraction of XMP meta data in the first place)
* versioning would be broken unless Oak would recognize special semantics in
the reference property. However, that would render that reference property
identical to the data store concept (which also stores are reference to an
external binary) - effectively duplicating the data store concept.
In order to address these use cases I propose to add a new “OakBinary” type
that inherits from the JCR binary and where we could implement the needed
methods.
Potentially, it would also be required to add a new high-level concept
“OakDatastore” exposed towards applications.
Going through the UC list:
UC1 - processing a binary in JCR with a native library that only has access
to the file system
IIUC this has 2 parts:
1. Direct (non-copy) access to the binary within the DS. Various proposals for
this have been mentioned on the thread “API proposal for - Expose URL for Blob
source”
2. Assuming that the native process has persisted the output in the DS directly
there still needs to be a node created the references that new binary. Not sure
if a new method is needed for that part.
>UC2: already supported using references
This only half true:
For the use case to work one would have to have the possibility to move the
binary via S3 copy (i.e. Copying the binary to the receiving S3 bucket - and
then passing the reference).
I believe this comes down to the “API proposal for - Expose URL for Blob
source” thread again.
>UC3: could be implemented with "fast random access reads" and changes in
>Tika.
IIUC then you suggest to add random (read) access to binaries - right? I think
that would be very useful.
>
>UC4: could we add a method "writeTo(WritableByteChannel target)"?
+1
>
>UC5: The SHA-1 hash could be exposed if available, I don't see why not.
>Plus maybe UC1 or UC4.
Would this be sufficient to address UC5?
It sounds to me that direct write access to the data store from application
level is needed.
(i.e. Exposing the DS to application level and have a method like addStream(…) )
>
>
>UC6: sounds like UC5
>
>UC7: we would need details (how many writes, do we need a new identifier
>for each write operation,...). Can be implemented quite efficiently for
>the BlobStore implementations (MongoBlobStore / RDBBlobStore /
>FileBlobStore).
Not sure why you need details on “how many writes”. But to give another example
(other than the one mentioned in wiki page):
Consider a large file in the repo (say a video) and an application writing XMP
metadata to that file (i.e. Modifying only a very small portion).
Cheers
Michael
>
>
>>As soon as a file path, a file
>>descriptor or an S3 object ID traverses the boundary between Oak and its
>>clients, all bets are off.
>
>Well, we would need to define the exact contract, and maybe access rights.
>
>> is the correctness of Oak depending on the behaviour of the user?
>
>To some extend, this is already the case.
>
>Regards,
>Thomas
>