Re: [Dspace-tech] Reusing bitstream sequence number

Richard Rodgers Mon, 18 Aug 2008 09:39:07 -0700

Hi Mark:

That's fine - any part of DSpace is fair game for debate. I
just wanted to inform the discussion that the current
design is based on a careful, reasonable analysis, and that
there may be hidden costs in alternatives.


I do worry about opening door #1 [content rejection],
since taking assets as found seems pretty close to the bedrock
use-case for digital repositories - at least preservation-minded ones.

Or to put it more provocatively: DSpace could keep its hands clean (and
its URLs pretty), but only by pushing the problem back on content
providers, who would be left with what you characterize as the "truly
awful" dirty work of ensuring unique filenames.

Food for thought,

Richard
 

On Sat, 2008-08-16 at 17:39 -0700, Mark Diggory wrote:
> Richard,
> 
> 
> I respectfully disagree with you.
> 
> On Aug 16, 2008, at 6:54 AM, Richard Rodgers wrote:
> 
> > Hi Mark:
> > 
> > Let me explain the problem more fully, which is a very simple
> > 'inconvenient truth' about assets: some complex digital objects we
> > we want to submit as one Item have filename duplications.
> > E.g. in directory 'q4' we have 'report.doc', but the same filename
> > in directory 'fy08' with different content. In the face of this, we
> > can:
> > 
> > (1) reject the content ("duplicate filenames detected! - please
> > correct
> > or resubmit as multiple items"), which is unacceptable.
> 
> 
> Is it really that unacceptable?!  I disagree, what use are two files
> with the same identical name in a DSpace Item? IMHO, it creates
> ambiguity in an area "file name" where users expect conformity with
> conventions. Really, which file would I choose to download if they had
> the same identical name? On top of this, what would I do with the
> second file when the OS/Browser asked me if I wanted to replace the
> first one I just downloaded, I suppose I'd have to rename it to arrive
> back at a state of being able to tell the two apart?
> 
> 
> No, instead we should be adopting RESTfull practices here, allowing
> DSpace to adhere to more conventional expectations.
> 
> 
> http://en.wikipedia.org/wiki/Representational_State_Transfer#RESTful_example:_the_World_Wide_Web
> 
> 
> Here, if DSpace "were" to take on REST'full practices in its URI
> conventions, we would be able to do things like versioning and
> predictable resource naming. For instance, in your example.
> 
> 
> PUT /bitstream/handle/1234.5/67890/q4/report.doc HTTP/1.1
> PUT /bitstream/handle/1234.5/67890/fy0/report.doc HTTP/1.1
> 
> 
> Would clearly result in two different bitstreams, whereas if I did do
> 
> 
> PUT /bitstream/handle/1234.5/67890/report.doc HTTP/1.1
> PUT /bitstream/handle/1234.5/67890/report.doc HTTP/1.1
> 
> 
> The second would be overwriting the first. Also a legitimate behavior
> allowing me to replace/version the resource (for which if I chose to
> expose access to might look like the following)...
> 
> 
> GET /bitstream/handle/1234.5/67890/report.doc?revision=1 HTTP/1.1
> 
> 
> and
> 
> 
> GET /bitstream/handle/1234.5/67890/report.doc?revision=0 HTTP/1.1
> 
> 
> Likewise,  we find this relative directory structure convention
> maintained in many other Internet resource related areas... in fact
> this is how the SIP METS and OCW  IMSCP packaging works based on basic
> zip files and manifests.  But, yet again the DSpace solution breaks
> the convention in this case. Take a METS/SIP package representing the
> following...
> 
> 
> package.zip$mets.xml
> package.zip$q4/report.doc
> package.zip$fy08/report.doc
> 
> 
> In current dspace parlance... might in turn result in...
> 
> 
> http://host/bitstream/handle/1234.5/67890/1/mets.xml
> http://host/bitstream/handle/1234.5/67890/2/q4/report.doc
> http://host/bitstream/handle/1234.5/67890/3/fy08/report.doc
> 
> 
> And now where the original relative references in the mets.xml were
> "proper" in relation to the files in the zip, they are now "NOT" when
> looking that the resultant URL's in DSpace.  Now, thats what I call
> an inconvenient PITA. And it comes up here with Johns issue, it came
> up in my DDI/VDC work, it came up again in Carl Jones work with the
> RVC/Stellar support and it was happening again with our attempting to
> predict the location of GIS files in a DSpace Items for the Dome GIS
> Lab interoperability work. Not good.
> 
> 
> Finally, on the Dissemination naming side, this breaks yet again. If I
> were instead to have the following item in DSpace:
> 
> 
> http://host/bitstream/handle/1234.5/67890/1/mets.xml
> http://host/bitstream/handle/1234.5/67890/2/report.doc
> http://host/bitstream/handle/1234.5/67890/3/report.doc
> 
> 
> I can't now use the file names to represent the files in the METS DIP.
> How can I have two different Zip Entries with the same file name?
> 
> 
> package.zip$mets.xml
> package.zip$report.doc
> package.zip$report.doc
> 
> 
> Just doesn't expand without one of the files getting overwritten.  
> 
> 
> No, this is a serious problem in the original design that is causing
> users/developers who expect conventional behavior and can't get it out
> of DSpace.
> 
> 
> > (2) accept the content, but transform or rewrite into unique
> > filenames 
> > (q4-report.doc? report[2].doc?, etc?), which is almost as bad, since
> > we
> > now have both obscured the original name, and altered what we are
> > supposed to be preserving.
> 
> 
> Wow, something we do agree on, that would be a truly awful solution.
> But yet, thats just what OS's and Browsers do, isn't it.
> 
> > or (3) [what DSpace currently does] store the filename as
> > *metadata*,
> > which, like file size, can be valuable, but which may not be unique,
> > and use a different identification system that *guarantees*
> > uniqueness
> > within the item (sequence id).
> 
> 
> Which unfortunately, (yet again) immediately diverges from the common
> expectations on files in a filesystem.  The allowance
> of duplicate file names actually introduces the entire problem we are
> talking about into DSpace because of a deviation from convention.
>  Because the system didn't initially enforce a requirement of unique
> file names within Items (unlike what is found in your local filesystem
> and the manifests of zip/tar/rar/etc archives), now suddenly, this
> allowance in DSpace is misdiagnosed as "the correct way" and the
> conventional uniqueness of file names as "wrong". This original work
> was IHMO, a wrong path taken.
> 
> 
> > I think because it's a number, the sequence ID is easily confused
> > with a version, which it is not. And in fact, there is nothing
> > sacred about  sequence numbers as a technique either: we also
> > considered MD5 checksums, timestamps, (maybe now uuids, etc);
> > sequence numbers
> > won because the URLs were shorter and easier to use.
> 
> 
> All based on the overly complex assumption early on in DSpace history
> that this was in fact a "big issue" that DSpace had to have such hacks
> done in it to solve. If the file path had just been accepted as
> unique, you wouldn't have this torture at all and DSpace Items would
> be containers, just like file directories and archives, thus adhere to
> those known standard conventions.
> 
> > The choice of ID schemes does have consequences, as some of John
> > P.'s
> > use-cases illustrate: a 'slot number' (which can be reassigned) is
> > different from a 'sequence number' (which can't), and we can debate
> > the comparative merits of each (or others): my point was that
> > filename is an apparent non-starter (for reasons above).
> 
> 
> I disagree, file name is the best place to start.  And my point is we
> just don't need these other schemes at all, the case where we do need
> unique ids on files with the same filenames, is if we introduce
> revision control.  And even then, that allowance of "same name
> +different revision" is across Item revisions and not across the
> bitstreams in those items.
> 
> > As to the 'heuristic' URLs in 1.5 Manakin, I regard them as closer
> > to
> > a bug than a solution. 
> 
> 
> Thats harsh.  I don't consider something that was discussed and
> thought out by the committers working on the XMLUI with
> an explicit end-goal of being a path to a better solution for all this
> in 2.0, a "bug".
> 
> 
> > Just as we would never use an online bank that looked up our account
> > files by taking the first match for our last names, so I think we
> > should not accept indeterminate semantics in bitstream retrieval (I
> > wanted 'fy08', but got 'q4') - that's what unique IDs are for.
> 
> 
> Thats not a valid comparison, I'm not talking about
> "indeterminate behavior" because there is none.  When properly
> implemented the behavior is predictable and results in the end with a
> system that treats file paths as unique resources in a system (the
> web) historically designed to do so.  It is certainly much simpler
> than "reassigning" and "obiviscating" file identification based on
> some perceived notion that there is a problem where there really was
> not one.
> 
> 
> -Mark
> 
> 
> > 
> > Mark Diggory wrote: 
> > > On Aug 15, 2008, at 12:15 PM, John Preston wrote:
> > >   
> > > > On Fri, Aug 15, 2008 at 1:40 PM, Richard Rodgers <[EMAIL PROTECTED]>  
> > > > wrote:
> > > >     
> > > > > On Fri, 2008-08-15 at 10:12 -0700, Mark Diggory wrote:
> > > > >       
> > > > > > > On Aug 15, 2008, at 9:36 AM, John Preston wrote:
> > > > > > >           
> > > > > > > > Hi. Can anyone say how I can re-use a bitstream sequence  
> > > > > > > > number. The
> > > > > > > > use case is the following....
> > > > > > > > 
> > > > > > > >             
> > > > > > On Aug 15, 2008, at 10:01 AM, Mark H. Wood wrote:
> > > > > > 
> > > > > >         
> > > > > > > Allowed or not, this sounds risky.  If you are overloading the
> > > > > > > sequence number with a new meaning, this practice is likely to 
> > > > > > > bite
> > > > > > > you again and again, since the developing stock code won't  
> > > > > > > recognize
> > > > > > > your second meaning and will take no pains to preserve it....
> > > > > > >           
> > > > > > Mark is correct about overloading the semantics here.  Note, We
> > > > > > adjusted the behavior behind the dspace 1.5 XMLUI (but not the  
> > > > > > JSPUI)
> > > > > > to allow for unsequenced name resolution of the bitstreams. For
> > > > > > instance:
> > > > > > ...
> > > > > > It certainly would have been much easier to key Bitstreams on the
> > > > > > name rather than a sequence id in the original architecture.  I've
> > > > > > seen requests such as yours numerous times during my history of
> > > > > > working on DSpace and being able to reference resources by simple
> > > > > > assignable predictable names rather than internally generated
> > > > > > sequence ids makes life on the outside of DSpace easier and 3rd  
> > > > > > party
> > > > > > tooling more powerful.  This is something I hope to take into the  
> > > > > > 2.0
> > > > > > development initiative.
> > > > > >         
> > > > > Easier perhaps, but unfortunately the Bitstream filename need not be
> > > > > unique, so is a problematic candidate for a durable reference.
> > > > >       
> > > Richard, that is the crux of my criticism. It would be easier and  
> > > more useful all around if the name were part of the identifier/re- 
> > > visioning strategy for the item in DSpace 2.0 using the name as the  
> > > identifier for the bitstream within the scope of that Item and its  
> > > item wide revision id, the current XMLUI support is a transition  
> > > somewhere between the original DSpace behavior and this Item re- 
> > > visioning end-goal of 2.0.
> > > 
> > > Likewise, Johns case is yet another example of why we need the  
> > > ability to assign such identifiers rather than have them assigned  
> > > internally.  And because John seeks to supply an updated version of  
> > > the file with the requirement that he not have to remove all the  
> > > bitstreams and recreate them in order reconstruct all the local  
> > > references to that specific bitstream within his item, its a  
> > > reasonable use case.  I encountered this when creating the DDI  
> > > metadata (relative URI) describing the data files I ported from the  
> > > Virtual Data Center to DSpace.
> > > 
> > > http://dspace.mit.edu/handle/1721.1/39118
> > > 
> > > Where I might have:
> > > 
> > > http://dspace.mit.edu/bitstream/handle/1721.1/39126/1/study.xml
> > > 
> > > How would I define my DDI's relative references to the other  
> > > bitstreams prior to having ingested the entire package representing  
> > > the Item into DSpace, when my external application doesn't have  
> > > access to this internally generated sequence id until after the fact?  
> > > (thats rhetorical and answered below)
> > > 
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/3/ 
> > > womenpolicymakers_census_dta.tab
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/2/ 
> > > womenpolicymakers_census.dta
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/5/ 
> > > womenpolicymakers_parta_dta.tab
> > > 
> > > rather than the above, reserving the name to be the unique identifier  
> > > and eliminating the bitstream sequence id from the path allows me  
> > > this flexibility.
> > > 
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/study.xml? 
> > > sequence=1
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/ 
> > > womenpolicymakers_census_dta.tab?sequence=3
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/ 
> > > womenpolicymakers_census.dta?sequence=2
> > > http://dspace-test.mit.edu/bitstream/handle/1721.1/39126/ 
> > > womenpolicymakers_parta_dta.tab?sequence=5
> > > 
> > > Can all be relatively referenced easily as (without uniqueness  
> > > constraints) if the heuristic for resolution is sensible and  
> > > predictable. I admit this heuristic is currently poorly defined and  
> > > could use adjustment to return the bitstream with the same name and  
> > > latest sequence id, thus becoming, in a sense a "poor mans" re- 
> > > visioning system for 1.5.
> > > 
> > > ./study.xml
> > > ./womenpolicymakers_census_dta.tab
> > > ./womenpolicymakers_census.dta
> > > ./womenpolicymakers_parta_dta.tab
> > > 
> > > And if I wish to retain the granularity of the seqence id as a  
> > > revision identifier when refering to the bitstream.
> > > 
> > > ./study.xml?sequence=1
> > > ./womenpolicymakers_census_dta.tab?sequence=3
> > > ./womenpolicymakers_census.dta?sequence=2
> > > ./womenpolicymakers_parta_dta.tab?sequence=5
> > > 
> > > Because of this "chicken-and-egg" problem that DSpace (pre 1.5 xmlui)  
> > > creates, I had to abandon any attempts to capture changes to the  
> > > bitstreams (or even the bitstreams initial sequence id) because of  
> > > the lack of granularity in the Import/Package Ingest process.  The  
> > > only way that Applications can relatively resolve the above relative  
> > > URI is to have a mechanism that tolerates the the usage of a  
> > > composite identifier, name[?sequence=revision id] as a unique  
> > > identifier with a sane default on the absence of the sequence_id  
> > > meaning to refer to the latest.
> > > 
> > > I don't think this is an unrealistic behavior to want out of the  
> > > system. SVN/VIEWVC handles the subject elegantly by returning the  
> > > most recent revision of a file
> > > 
> > > http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/ 
> > > dspace/docs/html/index.html
> > > 
> > > and allow the various other revisions of the filename which is unique  
> > > to the current revision to be returned from more complex queries that  
> > > can be maintained against it.
> > > 
> > > http://dspace.svn.sourceforge.net/viewvc/dspace/branches/dspace-1_5_x/ 
> > > dspace/docs/html/index.html?revision=3044
> > > 
> > > In fact, this allows a very elegant relative reference solution to  
> > > arise that doesn't require recalculation to place relative references  
> > > into the system. (And eliminates the need for a special service like  
> > > HTMLServlet to resolve these references using searches for matching   
> > > paths in the bitstream names. (Simply try navigating the above  
> > > documentation in the repository).
> > > 
> > >   
> > > > How will the versioning scheme, that I recall being talked about some
> > > > time ago, work. Did it not need to keep a stable reference to a
> > > > bitstream along with versions
> > > > 
> > > > John
> > > > 
> > > >     
> > > Yes, it does intend to, and currently that scheme is outdated in the  
> > > architectural review given a number of new considerations with the  
> > > usage of UUID's and referring to resources without nested hierarchies  
> > > of identifiers. There was also a bit of recent work that went on in  
> > > the Bristol meeting around relying on underlying support for  
> > > versioning in the storage layers of the new 2.0 architecture.  
> > > However, thats not completely thought out as well.
> > > 
> > > My current viewpoint on the subject was that the versioning  
> > > discussion in the architectural review outlined a need to have  
> > > versioning be at the Item level only. This meant that revisions would  
> > > be referred to via an item revision id rather than on individual  
> > > bitstream sequence ids. For instance
> > > 
> > > http://host/resource/[Item ID]/[Item_Version_ID]/[Manifestation_ID]/ 
> > > [File_ID]
> > > 
> > > And for example this might result in something that looks like:
> > > 
> > > http://host/resource/Item_X/Version_1/Manifestation_Y/study.xml
> > > http://host/resource/Item_X/Version_1/Manifestation_Y/ 
> > > womenpolicymakers_census_dta.tab
> > > http://host/resource/Item_X/Version_1/Manifestation_Y/ 
> > > womenpolicymakers_census.dta
> > > http://host/resource/Item_X/Version_1/Manifestation_Y/ 
> > > womenpolicymakers_parta_dta.ta
> > > 
> > > http://host/resource/Item_X/Version_2/Manifestation_Y/study.xml
> > > http://host/resource/Item_X/Version_2/Manifestation_Y/ 
> > > womenpolicymakers_census_dta.tab
> > > http://host/resource/Item_X/Version_2/Manifestation_Y/ 
> > > womenpolicymakers_census.dta
> > > http://host/resource/Item_X/Version_2/Manifestation_Y/ 
> > > womenpolicymakers_parta_dta.ta
> > > 
> > > where if I had just replaced "womenpolicymakers_census_dta.tab" and  
> > > the other referenced Bitsreams are just retained and mapped to the  
> > > new version Id.
> > > 
> > > This furthers my proposed strategy above by still retaining the  
> > > relative reference capabilities within the "critical bitstream  
> > > portion" of the path.
> > > 
> > > As well we talked about the following defaulting to the Latest  
> > > version, not unlike the behavior of SVN/VIEWVC.
> > > 
> > > http://host/resource/Item_X/Manifestation_Y/study.xml
> > > http://host/resource/Item_X/Manifestation_Y/ 
> > > womenpolicymakers_census_dta.tab
> > > http://host/resource/Item_X/Manifestation_Y/womenpolicymakers_census.dta
> > > http://host/resource/Item_X/Manifestation_Y/ 
> > > womenpolicymakers_parta_dta.ta
> > > 
> > > Note, if your confused about what a "Manifestation", it represents,  
> > > in the DSpace 2.0 model, a replacement for the Bundle that is  
> > > properly exposed and aligns with the Manifestation conceptualized in  
> > > the FRBR area of research.
> > > 
> > > Cheers,
> > > Mark
> > > 
> > > ~~~~~~~~~~~~~
> > > Mark R. Diggory - DSpace Developer and Systems Manager
> > > MIT Libraries, Systems and Technology Services
> > > Massachusetts Institute of Technology
> > > Home Page: http://purl.org/net/mdiggory/homepage
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -------------------------------------------------------------------------
> > > This SF.Net email is sponsored by the Moblin Your Move Developer's 
> > > challenge
> > > Build the coolest Linux based applications with Moblin SDK & win great 
> > > prizes
> > > Grand prize is a trip for two to an Open Source event anywhere in the 
> > > world
> > > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > > _______________________________________________
> > > DSpace-tech mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > >   
> > 
> 
> 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Reusing bitstream sequence number

Reply via email to