Re: [Dspace-tech] Reusing bitstream sequence number

Mark Diggory Sat, 16 Aug 2008 17:42:58 -0700

Richard,

I respectfully disagree with you.


On Aug 16, 2008, at 6:54 AM, Richard Rodgers wrote:

Hi Mark:

Let me explain the problem more fully, which is a very simple
'inconvenient truth' about assets: some complex digital objects we
we want to submit as one Item have filename duplications.
E.g. in directory 'q4' we have 'report.doc', but the same filenamein directory 'fy08' with different content. In the face of this, wecan:
(1) reject the content ("duplicate filenames detected! - pleasecorrect
or resubmit as multiple items"), which is unacceptable.

Is it really that unacceptable?! I disagree, what use are two fileswith the same identical name in a DSpace Item? IMHO, it createsambiguity in an area "file name" where users expect conformity withconventions. Really, which file would I choose to download if theyhad the same identical name? On top of this, what would I do with thesecond file when the OS/Browser asked me if I wanted to replace thefirst one I just downloaded, I suppose I'd have to rename it toarrive back at a state of being able to tell the two apart?

No, instead we should be adopting RESTfull practices here, allowingDSpace to adhere to more conventional expectations.

http://en.wikipedia.org/wiki/Representational_State_Transfer#RESTful_example:_the_World_Wide_Web

Here, if DSpace "were" to take on REST'full practices in its URIconventions, we would be able to do things like versioning andpredictable resource naming. For instance, in your example.


PUT /bitstream/handle/1234.5/67890/q4/report.doc HTTP/1.1
PUT /bitstream/handle/1234.5/67890/fy0/report.doc HTTP/1.1

Would clearly result in two different bitstreams, whereas if I did do

PUT /bitstream/handle/1234.5/67890/report.doc HTTP/1.1
PUT /bitstream/handle/1234.5/67890/report.doc HTTP/1.1

The second would be overwriting the first. Also a legitimate behaviorallowing me to replace/version the resource (for which if I chose toexpose access to might look like the following)...


GET /bitstream/handle/1234.5/67890/report.doc?revision=1 HTTP/1.1

and

GET /bitstream/handle/1234.5/67890/report.doc?revision=0 HTTP/1.1

Likewise, we find this relative directory structure conventionmaintained in many other Internet resource related areas... in factthis is how the SIP METS and OCW IMSCP packaging works based onbasic zip files and manifests. But, yet again the DSpace solutionbreaks the convention in this case. Take a METS/SIP packagerepresenting the following...


package.zip$mets.xml
package.zip$q4/report.doc
package.zip$fy08/report.doc

In current dspace parlance... might in turn result in...

http://host/bitstream/handle/1234.5/67890/1/mets.xml
http://host/bitstream/handle/1234.5/67890/2/q4/report.doc
http://host/bitstream/handle/1234.5/67890/3/fy08/report.doc

And now where the original relative references in the mets.xml were"proper" in relation to the files in the zip, they are now "NOT" whenlooking that the resultant URL's in DSpace. Now, thats what I callan inconvenient PITA. And it comes up here with Johns issue, it cameup in my DDI/VDC work, it came up again in Carl Jones work with theRVC/Stellar support and it was happening again with our attempting topredict the location of GIS files in a DSpace Items for the Dome GISLab interoperability work. Not good.

Finally, on the Dissemination naming side, this breaks yet again. IfI were instead to have the following item in DSpace:


http://host/bitstream/handle/1234.5/67890/1/mets.xml
http://host/bitstream/handle/1234.5/67890/2/report.doc
http://host/bitstream/handle/1234.5/67890/3/report.doc

I can't now use the file names to represent the files in the METSDIP. How can I have two different Zip Entries with the same file name?


package.zip$mets.xml
package.zip$report.doc
package.zip$report.doc

Just doesn't expand without one of the files getting overwritten.

No, this is a serious problem in the original design that is causingusers/developers who expect conventional behavior and can't get itout of DSpace.

(2) accept the content, but transform or rewrite into unique filenames

(q4-report.doc? report[2].doc?, etc?), which is almost as bad,since we

now have both obscured the original name, and altered what we are
supposed to be preserving.

Wow, something we do agree on, that would be a truly awful solution.But yet, thats just what OS's and Browsers do, isn't it.

or (3) [what DSpace currently does] store the filename as *metadata*,
which, like file size, can be valuable, but which may not be unique,
and use a different identification system that *guarantees* uniqueness
within the item (sequence id).

Which unfortunately, (yet again) immediately diverges from the commonexpectations on files in a filesystem. The allowance of duplicatefile names actually introduces the entire problem we are talkingabout into DSpace because of a deviation from convention. Becausethe system didn't initially enforce a requirement of unique filenames within Items (unlike what is found in your local filesystem andthe manifests of zip/tar/rar/etc archives), now suddenly, thisallowance in DSpace is misdiagnosed as "the correct way" and theconventional uniqueness of file names as "wrong". This original workwas IHMO, a wrong path taken.

I think because it's a number, the sequence ID is easily confusedwith a version, which it is not. And in fact, there is nothingsacred about sequence numbers as a technique either: we alsoconsidered MD5 checksums, timestamps, (maybe now uuids, etc);sequence numbers
won because the URLs were shorter and easier to use.

All based on the overly complex assumption early on in DSpace historythat this was in fact a "big issue" that DSpace had to have suchhacks done in it to solve. If the file path had just been accepted asunique, you wouldn't have this torture at all and DSpace Items wouldbe containers, just like file directories and archives, thus adhereto those known standard conventions.

The choice of ID schemes does have consequences, as some of John P.'s
use-cases illustrate: a 'slot number' (which can be reassigned) isdifferent from a 'sequence number' (which can't), and we can debatethe comparative merits of each (or others): my point was thatfilename is an apparent non-starter (for reasons above).

I disagree, file name is the best place to start. And my point is wejust don't need these other schemes at all, the case where we do needunique ids on files with the same filenames, is if we introducerevision control. And even then, that allowance of "same name+different revision" is across Item revisions and not across thebitstreams in those items.

As to the 'heuristic' URLs in 1.5 Manakin, I regard them as closer to
a bug than a solution.

Thats harsh. I don't consider something that was discussed andthought out by the committers working on the XMLUI with an explicitend-goal of being a path to a better solution for all this in 2.0, a"bug".

Just as we would never use an online bank that looked up ouraccount files by taking the first match for our last names, so Ithink we should not accept indeterminate semantics in bitstreamretrieval (I wanted 'fy08', but got 'q4') - that's what unique IDsare for.

Thats not a valid comparison, I'm not talking about "indeterminatebehavior" because there is none. When properly implemented thebehavior is predictable and results in the end with a system thattreats file paths as unique resources in a system (the web)historically designed to do so. It is certainly much simpler than"reassigning" and "obiviscating" file identification based on someperceived notion that there is a problem where there really was not one.


-Mark


Mark Diggory wrote:


On Aug 15, 2008, at 12:15 PM, John Preston wrote:

On Fri, Aug 15, 2008 at 1:40 PM, Richard Rodgers <[EMAIL PROTECTED]>
wrote:

On Fri, 2008-08-15 at 10:12 -0700, Mark Diggory wrote:

On Aug 15, 2008, at 9:36 AM, John Preston wrote:

Hi. Can anyone say how I can re-use a bitstream sequence
number. The
use case is the following....

On Aug 15, 2008, at 10:01 AM, Mark H. Wood wrote:

Allowed or not, this sounds risky.  If you are overloading the

sequence number with a new meaning, this practice is likely tobite

you again and again, since the developing stock code won't
recognize
your second meaning and will take no pains to preserve it....

Mark is correct about overloading the semantics here.  Note, We
adjusted the behavior behind the dspace 1.5 XMLUI (but not the
JSPUI)
to allow for unsequenced name resolution of the bitstreams. For
instance:
...
It certainly would have been much easier to key Bitstreams on the
name rather than a sequence id in the original architecture.  I've
seen requests such as yours numerous times during my history of
working on DSpace and being able to reference resources by simple
assignable predictable names rather than internally generated
sequence ids makes life on the outside of DSpace easier and 3rd
party
tooling more powerful.  This is something I hope to take into the
2.0
development initiative.

Easier perhaps, but unfortunately the Bitstream filename neednot be

unique, so is a problematic candidate for a durable reference.

Richard, that is the crux of my criticism. It would be easier and
more useful all around if the name were part of the identifier/re-
visioning strategy for the item in DSpace 2.0 using the name as the
identifier for the bitstream within the scope of that Item and its
item wide revision id, the current XMLUI support is a transition
somewhere between the original DSpace behavior and this Item re-
visioning end-goal of 2.0.

Likewise, Johns case is yet another example of why we need the
ability to assign such identifiers rather than have them assigned
internally.  And because John seeks to supply an updated version of
the file with the requirement that he not have to remove all the
bitstreams and recreate them in order reconstruct all the local
references to that specific bitstream within his item, its a
reasonable use case.  I encountered this when creating the DDI
metadata (relative URI) describing the data files I ported from the
Virtual Data Center to DSpace.

http://dspace.mit.edu/handle/1721.1/39118

Where I might have:

http://dspace.mit.edu/bitstream/handle/1721.1/39126/1/study.xml

How would I define my DDI's relative references to the other
bitstreams prior to having ingested the entire package representing
the Item into DSpace, when my external application doesn't have
access to this internally generated sequence id until after the fact?
(thats rhetorical and answered below)

http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/3/
womenpolicymakers_census_dta.tab
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/2/
womenpolicymakers_census.dta
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/5/
womenpolicymakers_parta_dta.tab

rather than the above, reserving the name to be the unique identifier
and eliminating the bitstream sequence id from the path allows me
this flexibility.

http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/study.xml?
sequence=1
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/
womenpolicymakers_census_dta.tab?sequence=3
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/
womenpolicymakers_census.dta?sequence=2
http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/
womenpolicymakers_parta_dta.tab?sequence=5

Can all be relatively referenced easily as (without uniqueness
constraints) if the heuristic for resolution is sensible and
predictable. I admit this heuristic is currently poorly defined and
could use adjustment to return the bitstream with the same name and
latest sequence id, thus becoming, in a sense a "poor mans" re-
visioning system for 1.5.

./study.xml
./womenpolicymakers_census_dta.tab
./womenpolicymakers_census.dta
./womenpolicymakers_parta_dta.tab

And if I wish to retain the granularity of the seqence id as a
revision identifier when refering to the bitstream.

./study.xml?sequence=1
./womenpolicymakers_census_dta.tab?sequence=3
./womenpolicymakers_census.dta?sequence=2
./womenpolicymakers_parta_dta.tab?sequence=5

Because of this "chicken-and-egg" problem that DSpace (pre 1.5 xmlui)
creates, I had to abandon any attempts to capture changes to the
bitstreams (or even the bitstreams initial sequence id) because of
the lack of granularity in the Import/Package Ingest process.  The
only way that Applications can relatively resolve the above relative
URI is to have a mechanism that tolerates the the usage of a
composite identifier, name[?sequence=revision id] as a unique
identifier with a sane default on the absence of the sequence_id
meaning to refer to the latest.

I don't think this is an unrealistic behavior to want out of the
system. SVN/VIEWVC handles the subject elegantly by returning the
most recent revision of a file

http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/

dspace/docs/html/index.html

and allow the various other revisions of the filename which is unique
to the current revision to be returned from more complex queries that
can be maintained against it.

http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/

dspace/docs/html/index.html?revision=3044

In fact, this allows a very elegant relative reference solution to
arise that doesn't require recalculation to place relative references
into the system. (And eliminates the need for a special service like
HTMLServlet to resolve these references using searches for matching
paths in the bitstream names. (Simply try navigating the above
documentation in the repository).

How will the versioning scheme, that I recall being talked aboutsome
time ago, work. Did it not need to keep a stable reference to a
bitstream along with versions

John

Yes, it does intend to, and currently that scheme is outdated in the
architectural review given a number of new considerations with the
usage of UUID's and referring to resources without nested hierarchies
of identifiers. There was also a bit of recent work that went on in
the Bristol meeting around relying on underlying support for
versioning in the storage layers of the new 2.0 architecture.
However, thats not completely thought out as well.

My current viewpoint on the subject was that the versioning
discussion in the architectural review outlined a need to have
versioning be at the Item level only. This meant that revisions would
be referred to via an item revision id rather than on individual
bitstream sequence ids. For instance

http://host/resource/[Item ID]/[Item_Version_ID]/[Manifestation_ID]/
[File_ID]

And for example this might result in something that looks like:

http://host/resource/Item_X/Version_1/Manifestation_Y/study.xml
http://host/resource/Item_X/Version_1/Manifestation_Y/
womenpolicymakers_census_dta.tab
http://host/resource/Item_X/Version_1/Manifestation_Y/
womenpolicymakers_census.dta
http://host/resource/Item_X/Version_1/Manifestation_Y/
womenpolicymakers_parta_dta.ta

http://host/resource/Item_X/Version_2/Manifestation_Y/study.xml
http://host/resource/Item_X/Version_2/Manifestation_Y/
womenpolicymakers_census_dta.tab
http://host/resource/Item_X/Version_2/Manifestation_Y/
womenpolicymakers_census.dta
http://host/resource/Item_X/Version_2/Manifestation_Y/
womenpolicymakers_parta_dta.ta

where if I had just replaced "womenpolicymakers_census_dta.tab" and
the other referenced Bitsreams are just retained and mapped to the
new version Id.

This furthers my proposed strategy above by still retaining the
relative reference capabilities within the "critical bitstream
portion" of the path.

As well we talked about the following defaulting to the Latest
version, not unlike the behavior of SVN/VIEWVC.

http://host/resource/Item_X/Manifestation_Y/study.xml
http://host/resource/Item_X/Manifestation_Y/
womenpolicymakers_census_dta.tab

http://host/resource/Item_X/Manifestation_Y/womenpolicymakers_census.dta

http://host/resource/Item_X/Manifestation_Y/
womenpolicymakers_parta_dta.ta

Note, if your confused about what a "Manifestation", it represents,
in the DSpace 2.0 model, a replacement for the Bundle that is
properly exposed and aligns with the Manifestation conceptualized in
the FRBR area of research.

Cheers,
Mark

~~~~~~~~~~~~~
Mark R. Diggory - DSpace Developer and Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology
Home Page: http://purl.org/net/mdiggory/homepage

-------------------------------------------------------------------------This SF.Net email is sponsored by the Moblin Your Move Developer'schallengeBuild the coolest Linux based applications with Moblin SDK & wingreat prizesGrand prize is a trip for two to an Open Source event anywhere inthe world

http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Reusing bitstream sequence number

Reply via email to