Re: [fcrepo-dev] More food for 4.0 thought: fcrepo-store

aj...@virginia.edu Wed, 04 Apr 2012 06:15:48 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Comments in-line.

- ---
A. Soroka
Software & Systems Engineering :: Online Library Environment
the University of Virginia Library

On Mar 30, 2012, at 8:45 AM, Chris Wilper wrote:

> I'm not sure I see how a domain model's accuracy could be degraded by 
> increasing the atomicity of the persistence entities.

It's the multiplication of identities for purposes of workflow that troubles 
me. It may very well be that I'm making a mountain out of a molehill, so I 
welcome comment about this. I'm mildly worried that people will find themselves 
with content modeling that is based both on their actual content and in their 
needs for transactionality, when ideally, notions of transactionality in Fedora 
would be adjustable so that such needs could be addressed after content 
modeling is done. 

> Stepping back for a second, I think it's interesting that the line between 
> datastream and object has become less distinct in recent
> years. I'm not sure that's such a bad thing, and it makes me question the 
> value of continuing to treat datastreams as second-class citizens in the 
> architecture. Are there good pure domain modeling reasons? Information hiding 
> is all I can think of, but that seems like a concern you'd want to layer on 
> top of whatever persistence mechanism you had at your disposal. In other 
> words, you're welcome to think of a certain set of your Fedora objects as 
> private members of some other set of Fedora objects.

Amen! I wasn't going to open that question as part of this discussion, but 
since you've already gone there... {grin} The inception of 4.x seems to me to 
be a great time to step back, look at the Fedora architecture itself (as 
distinct from Fedora Commons as a software project) and its object model, and 
ask ourselves some deep questions about where we want to take it. But perhaps 
we can start this as a separate conversation and cross-reference them as needed?

> Ahh, but in the brave new pluggable world, there is no requirement that you 
> allocate a filesystem inode (for example) for every Fedora object. What if 
> you opted to store each Fedora object in a highly scalable relational 
> database instead?

Good point. Pushing scaling concerns out of the kernel might be a good 
architectural "habit" in general...

> I'd be interested to get others' thoughts on this too, but it seems like if 
> we wanted to accomodate "more-granular-than-object-level"
> atomicity in the design, it would look quite different from what's been 
> discussed so far with HLStorage/fcrepo-store. And so far in the discussion, 
> keeping it at the Fedora object level, we've been able to punt on/ignore some 
> of the details with the FedoraObject design. It'd be good to talk about those 
> in some detail as well, regardless of where we end up on this thread, so I 
> welcome the discussion.

That's true, and I don't want to dig any rabbit holes to go down. But talk is 
cheap, so let's have some. {grin}

> To clarify, by "deep" copy, I mean that it's a full copy of whatever members 
> the original object had, so that changing a value or field
> somewhere in one does not affect the other. The important bit is just that 
> oldObj does not change.

Right. Immutability to the fore! (Incidentally, for those who've been following 
this discussion, Chris' pointer to Rich Hickey's talk is really well worth 
following. It's a great high-level discussion of how (and how not) to treat 
with time in software to build high-concurrency systems and robust systems 
generally.)

> Currently in the dto design, FedoraObject instances do not provide access to 
> binary data, even if managed. They just point to it via the 
> DatastreamVersion.contentLocation (URL) accessor.  If that approach were used 
> here (just a pointer and nothing more, as you suggest), let's think about 
> what would happen within the fcrepo-store/HLStore impl in response to the 
> following requests:
> 
> .add(FedoraObject obj):
> Any managed datastream referenced in obj would need to be resolved by the 
> store impl, then stored. Key Question: Is the original reference (dsLocation) 
> retained? Today in Fedora it's in fact not retained; it's changed. Options 
> are: Keep it as-is, change it to pid+dsId+dsVersionId (as is done today), or 
> remove it. Personally I am beginning to think removing it might be the right 
> move, but that fights with my instinct to store the FedoraObject instance 
> "exactly as given"

This doesn't speak exactly to the question, but if datastreams become 
first-class citizens, then there's another possible choice-- replace it with an 
opaque ID that is independent of the object (perhaps do this on the creation of 
the datastream as a Fedora construct, which would return us to "Keep it as-is").

> .update(FedoraObject oldObj, FedoraObject newObj):
> For each managed datastream referenced in oldObj:
> - if it's not in newObj, delete it from storage
> - if it's in newObj but with a different location, replace the old content
> - if it's in newObj but with the same location, maybe we still need to 
> replace the old content. How do we know if that's necessary?

If it's _really_ just an URI, then the cost of replacing it (in the object) 
thoughtlessly is low, and perhaps that's a simple but sufficient strategy. The 
cost of reloading it is variable, of course... and the repository can't know at 
ingest time, unless perhaps it's told?

> .get(pid)
> In this case, if a only a reference (URL) to the content is provided, the 
> caller (client of the store) needs to resolve it to get the
> content.

I don't see any problem with that, especially because the caller might not 
_want_ the content, or not immediately. For example, the caller might not know 
if it wants the content until after examining the object (or even datastream) 
metadata.

> A whole 'nother possibility is that FedoraObject instances don't pass managed 
> datastreams in by reference (URL) at all. Instead they're passed in by value 
> (via a getManagedContent()) method. In that design, the store impl doesn't 
> need to be responsible for resolving anything on an add request...it just 
> streams the given content to storage.

Ooh, that kind of troubles me. It would mean that the kernel would have to flow 
a lot of the managed data through with expedition and we'd have to be very sure 
about its ability to so do. Especially in light of the possibility that 
datastreams might be raised to the status of first-class citizens, I think 
perhaps we would want to go _more_ in the direction of handling them by 
reference as much as possible.

- -ajs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJPfElNAAoJEATpPYSyaoIkG+UH/19dXF7AU2MgAG+RnkInDdEV
oVNs0yIvJ7HEXomBzRnEj2SNsNjOI3Sv3l0Usy0XbR+swVD5+6um590Btt3pFfgQ
PBthF1kjo+zAbfcOYFPQHnQTf6vmmVBD/Kbbb7PoJblF4CELcsxSCxRBBpiAB5k/
s6XnD5gU0OVuFtA0uNzLYYUXABsGidQHPKCOgF+FrYZUTaI9KuL5SalFCbMQ5Ujc
WkGnW0ZxLbX8xSlzrPXRqqAq8Dgr53luiD3/+Y8KKOt/QROqYiRqaWKoBKxPvb8E
JHxs28LHvLUbzYkWqEYav3627Fdk0VunLiwoQmHPie5FRhHGKpYS8WN1O0IFfSI=
=70L9
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [fcrepo-dev] More food for 4.0 thought: fcrepo-store

Reply via email to