Re: [Dspace-tech] Reusing bitstream sequence number

Richard Rodgers Sat, 16 Aug 2008 07:00:11 -0700

Hi Mark:

Let me explain the problem more fully, which is a very simple
'inconvenient truth' about assets: some complex digital objects we
we want to submit as one Item have filename duplications.

E.g. in directory 'q4' we have 'report.doc', but the same filename indirectory 'fy08' with different content. In the face of this, we can:


(1) reject the content ("duplicate filenames detected! - please correct
or resubmit as multiple items"), which is unacceptable.

(2) accept the content, but transform or rewrite into unique filenames
(q4-report.doc? report[2].doc?, etc?), which is almost as bad, since we
now have both obscured the original name, and altered what we are
supposed to be preserving.

or (3) [what DSpace currently does] store the filename as *metadata*,
which, like file size, can be valuable, but which may not be unique,
and use a different identification system that *guarantees* uniqueness
within the item (sequence id).

I think because it's a number, the sequence ID is easily confused with aversion, which it is not. And in fact, there is nothing sacred aboutsequence numbers as a technique either: we also considered MD5checksums, timestamps, (maybe now uuids, etc); sequence numbers

won because the URLs were shorter and easier to use.

The choice of ID schemes does have consequences, as some of John P.'s

use-cases illustrate: a 'slot number' (which can be reassigned) isdifferent from a 'sequence number' (which can't), and we can debate thecomparative merits of each (or others): my point was that filename is anapparent non-starter (for reasons above).


As to the 'heuristic' URLs in 1.5 Manakin, I regard them as closer to

a bug than a solution. Just as we would never use an online bank thatlooked up our account files by taking the first match for our lastnames, so I think we should not accept indeterminate semantics inbitstream retrieval (I wanted 'fy08', but got 'q4') - that's what uniqueIDs are for.


My 2 cents,

Richard




Mark Diggory wrote:

On Aug 15, 2008, at 12:15 PM, John Preston wrote:
On Fri, Aug 15, 2008 at 1:40 PM, Richard Rodgers <[EMAIL PROTECTED]>wrote:
On Fri, 2008-08-15 at 10:12 -0700, Mark Diggory wrote:
On Aug 15, 2008, at 9:36 AM, John Preston wrote:
Hi. Can anyone say how I can re-use a bitstream sequencenumber. The
use case is the following....
On Aug 15, 2008, at 10:01 AM, Mark H. Wood wrote:
Allowed or not, this sounds risky.  If you are overloading the
sequence number with a new meaning, this practice is likely to bite
you again and again, since the developing stock code won'trecognize
your second meaning and will take no pains to preserve it....
Mark is correct about overloading the semantics here.  Note, We
adjusted the behavior behind the dspace 1.5 XMLUI (but not theJSPUI)
to allow for unsequenced name resolution of the bitstreams. For
instance:
...
It certainly would have been much easier to key Bitstreams on the
name rather than a sequence id in the original architecture.  I've
seen requests such as yours numerous times during my history of
working on DSpace and being able to reference resources by simple
assignable predictable names rather than internally generated
sequence ids makes life on the outside of DSpace easier and 3rdpartytooling more powerful. This is something I hope to take into the2.0
development initiative.
Easier perhaps, but unfortunately the Bitstream filename need not be
unique, so is a problematic candidate for a durable reference.
Richard, that is the crux of my criticism. It would be easier andmore useful all around if the name were part of the identifier/re-visioning strategy for the item in DSpace 2.0 using the name as theidentifier for the bitstream within the scope of that Item and itsitem wide revision id, the current XMLUI support is a transitionsomewhere between the original DSpace behavior and this Item re-visioning end-goal of 2.0.
Likewise, Johns case is yet another example of why we need theability to assign such identifiers rather than have them assignedinternally. And because John seeks to supply an updated version ofthe file with the requirement that he not have to remove all thebitstreams and recreate them in order reconstruct all the localreferences to that specific bitstream within his item, its areasonable use case. I encountered this when creating the DDImetadata (relative URI) describing the data files I ported from theVirtual Data Center to DSpace.
http://dspace.mit.edu/handle/1721.1/39118

Where I might have:

http://dspace.mit.edu/bitstream/handle/1721.1/39126/1/study.xml
How would I define my DDI's relative references to the otherbitstreams prior to having ingested the entire package representingthe Item into DSpace, when my external application doesn't haveaccess to this internally generated sequence id until after the fact?(thats rhetorical and answered below)
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/3/womenpolicymakers_census_dta.tabhttp://dspace-test.mit.edu/bitstream/handle/1721.1/39126/2/womenpolicymakers_census.dtahttp://dspace-test.mit.edu/bitstream/handle/1721.1/39126/5/womenpolicymakers_parta_dta.tab
rather than the above, reserving the name to be the unique identifierand eliminating the bitstream sequence id from the path allows methis flexibility.
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/study.xml?sequence=1http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/womenpolicymakers_census_dta.tab?sequence=3http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/womenpolicymakers_census.dta?sequence=2http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/womenpolicymakers_parta_dta.tab?sequence=5
Can all be relatively referenced easily as (without uniquenessconstraints) if the heuristic for resolution is sensible andpredictable. I admit this heuristic is currently poorly defined andcould use adjustment to return the bitstream with the same name andlatest sequence id, thus becoming, in a sense a "poor mans" re-visioning system for 1.5.
./study.xml
./womenpolicymakers_census_dta.tab
./womenpolicymakers_census.dta
./womenpolicymakers_parta_dta.tab
And if I wish to retain the granularity of the seqence id as arevision identifier when refering to the bitstream.
./study.xml?sequence=1
./womenpolicymakers_census_dta.tab?sequence=3
./womenpolicymakers_census.dta?sequence=2
./womenpolicymakers_parta_dta.tab?sequence=5
Because of this "chicken-and-egg" problem that DSpace (pre 1.5 xmlui)creates, I had to abandon any attempts to capture changes to thebitstreams (or even the bitstreams initial sequence id) because ofthe lack of granularity in the Import/Package Ingest process. Theonly way that Applications can relatively resolve the above relativeURI is to have a mechanism that tolerates the the usage of acomposite identifier, name[?sequence=revision id] as a uniqueidentifier with a sane default on the absence of the sequence_idmeaning to refer to the latest.
I don't think this is an unrealistic behavior to want out of thesystem. SVN/VIEWVC handles the subject elegantly by returning themost recent revision of a file
http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/dspace/docs/html/index.html
and allow the various other revisions of the filename which is uniqueto the current revision to be returned from more complex queries thatcan be maintained against it.
http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/dspace/docs/html/index.html?revision=3044
In fact, this allows a very elegant relative reference solution toarise that doesn't require recalculation to place relative referencesinto the system. (And eliminates the need for a special service likeHTMLServlet to resolve these references using searches for matchingpaths in the bitstream names. (Simply try navigating the abovedocumentation in the repository).
How will the versioning scheme, that I recall being talked about some
time ago, work. Did it not need to keep a stable reference to a
bitstream along with versions

John
Yes, it does intend to, and currently that scheme is outdated in thearchitectural review given a number of new considerations with theusage of UUID's and referring to resources without nested hierarchiesof identifiers. There was also a bit of recent work that went on inthe Bristol meeting around relying on underlying support forversioning in the storage layers of the new 2.0 architecture.However, thats not completely thought out as well.
My current viewpoint on the subject was that the versioningdiscussion in the architectural review outlined a need to haveversioning be at the Item level only. This meant that revisions wouldbe referred to via an item revision id rather than on individualbitstream sequence ids. For instance
http://host/resource/[Item ID]/[Item_Version_ID]/[Manifestation_ID]/[File_ID]
And for example this might result in something that looks like:

http://host/resource/Item_X/Version_1/Manifestation_Y/study.xml
http://host/resource/Item_X/Version_1/Manifestation_Y/womenpolicymakers_census_dta.tabhttp://host/resource/Item_X/Version_1/Manifestation_Y/womenpolicymakers_census.dtahttp://host/resource/Item_X/Version_1/Manifestation_Y/womenpolicymakers_parta_dta.ta
http://host/resource/Item_X/Version_2/Manifestation_Y/study.xml
http://host/resource/Item_X/Version_2/Manifestation_Y/womenpolicymakers_census_dta.tabhttp://host/resource/Item_X/Version_2/Manifestation_Y/womenpolicymakers_census.dtahttp://host/resource/Item_X/Version_2/Manifestation_Y/womenpolicymakers_parta_dta.ta
where if I had just replaced "womenpolicymakers_census_dta.tab" andthe other referenced Bitsreams are just retained and mapped to thenew version Id.
This furthers my proposed strategy above by still retaining therelative reference capabilities within the "critical bitstreamportion" of the path.
As well we talked about the following defaulting to the Latestversion, not unlike the behavior of SVN/VIEWVC.
http://host/resource/Item_X/Manifestation_Y/study.xml
http://host/resource/Item_X/Manifestation_Y/womenpolicymakers_census_dta.tab
http://host/resource/Item_X/Manifestation_Y/womenpolicymakers_census.dta
http://host/resource/Item_X/Manifestation_Y/womenpolicymakers_parta_dta.ta
Note, if your confused about what a "Manifestation", it represents,in the DSpace 2.0 model, a replacement for the Bundle that isproperly exposed and aligns with the Manifestation conceptualized inthe FRBR area of research.
Cheers,
Mark

~~~~~~~~~~~~~
Mark R. Diggory - DSpace Developer and Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology
Home Page: http://purl.org/net/mdiggory/homepage






-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Reusing bitstream sequence number

Reply via email to