I certainly don't disagree with much of that your saying. I think we are looking at the same subject from different perspectives. My statements may be assumptive, but we need to start somewhere with discussion if we are to come to common understanding...
On Mar 5, 2008, at 5:36 PM, Stefano Mazzocchi wrote: > > I personally think that it's very easy to underestimate the amount of > energy that goes into "mandating coherence". I simply want to see DSpace provide an interface that promotes reuse of existing values/equivalencies, to be providing a feedback loop that works to inform the the producers and maintainers of the content about common or popular terminology. I'm not suggesting it to be a "top down mandate", but a "bottom-up" enabling of the user experience such that we can "entice" them to take popular suggested values as their own personal choice. > Any dataset is an exercise of data integration, even in the simple > case > of a single dataset with data entry performed by more than one > individual (not necessarily at the same time!). > > "Hal Abelson" and "H. Abelson" and "Harold Abelson" and "Abelson, > Harold" are all correct forms and unless all the people doing data > entry > (or data entry oversight, or batch upload, or data integration or > data > conversion + upload) knows precisely which one is the mandated > canonical > (and never makes mistakes!), you're doomed to have multiple forms > anyway. Still, I think the user interfaces for editing metadata in DSpace can exert some control over this experience. Especially if, metadata is organized more appropriately under the hood and we have more feedback loops into the system. 1.) At least having metadata your attempting to "control" broken up into statements with URI (as you suggested earlier) opens up the door for having equivalencies managed somewhere and stored in Longwell (I even wonder if these initial uri could be automatically generated, for instance, based on checksums of the existing values, establishing that in the Item statements generated by DSpace something like... <sha:67eaf8ea6b219545fe7a6881f209f28c> rdf:label "Hal Abelson" would be true everytime it showed up as a metadata value in a DSpace instance and at least give us a starting point for making equivalencies) 2.) Exposing some sort of web-services (used very loosely) on longwell would provide a feedback mechanism that could be used to populate suggestion dropdowns or controlled vocabularies in DSpace (or entirely different applications). Or even more-so, inform a UI devoted to producing mappings or accepting mapped values as a "correction" to the existing value. This would provide a user experience where they were more apt to select a system provided value over their own text. Ultimately, Longwell (or at least sesame+banache) would become the center of a feedback mechanism that would allow users (curators, submitters and/ or regular users) that may be making their own statements (equivalencies) about the data to inform and feedback to the source. And if this can be used to adjust the data at the source... then the data source can become cleaner iteratively over time thanks to the distributed efforts of its users. > The natural tendency is to have a 'validation' phase that makes sure > that only 'clean and coherent' data enters the system. Of course, > it is > naive to think that this would actually work at any reasonable scale > without an exponentially growing maintenance cost. Though, no matter whether managed in DSpace or in Longwell, we (library operations/ repository maintainers) will still always be incurring this cost. We need to at least attempt to manage the source (given we control the application in the first place). No matter if we are placing the mappings downstream in Longwell or upstream in DSpace, we (being library operations/ repository maintainers) are still having to do the metadata management and are still looking for the tools to assist us in doing so. And we (MIT Libraries) have be expanding our team to do this under mandates from library directorship. Our current mandate (as I interpret my role in it) to provide tools that will make this more manageable as we expand into curating Digital Collections in DSpace and seek to bring in more content from faculty and departments. > (yes, libraries and museum all do that... last time I checked, they > were > all complaining about the cost and inadequacy of their metadata > management) Yes, I think we agree we are talking about the inadequacy of our tools and how to improve them. My challenge is that placing Longwell in front of (or beside) DSpace does not "solve" that problem, without a way for the layperson to edit the relationships, it not only forces the management of such equivalencies back into the hands of the developer again, now no one but the developer can maintain them. Something has to be created that will produce and maintain those statements of equivalency. At least altering the metadata in DSpace can be accomplished from a web UI, albeit tediously, one item at a time. Its at this intersection of the two platforms that needs exposure to the user. In the DSpace community we need to better enable the users to manage their (and others) metadata in DSpace. I think this is a smaller problem to solve than attempting to solve it generally and independently for a larger Semantic Web community. > > The use of equivalences is, IMO, not a UI hack but an entirely > different > paradigm that 'embraces' diversity and entropic variability as a > fact of > life (think thermodynamics) and doesn't treat it as a problem. > > By saying that "Hal Abelson" and "H. Abelson" are actually different > "labels" of the same "entity", and recording that information > alongside > your data, you are not only correcting today's incoherence but > preventing future one from repeating as well. I didn't mean to "belittle" the concept by suggesting it in the presentation layer. I very much respect the concept, the capability it provides and want to do more with it. I want to see it inform the user making that future incoherence. And if they really mean to be making that incoherence, have a system that allows them to be the one that defines it and/or clarify it with any equivalency it may have to an existing entity. > > It's a 'pave the cowpath' approach: see what incoherences exist and > deal > with them, in a way that is reproducible and with information that can > be shared and repurposed. > > Sure it seems easier to write and run a perl script that hits a > controlled vocabulary, or the MIT directory or a thesaurus or a > gazetteer... but *that*, IMO, it's the hack, the technological band- > aid > to a deeper, intrinsic, dynamic of data maintenance. > > Of course, both approaches can cohexist but be careful about thinking > that equivalences are useful only when you lack control.... I've found > out that exercising control is always much harder than it looks, even > when, on paper, you have plenty. thanks, I can see how thats true and can use it to make a more informed analysis. As I said at the beginning, we have to start hammering out our assumptions somewhere. This allows me to refine my statements and clarify that I want the "mapping of equivalencies" to be usable as a tool to manage my source of data, and that sharing them with the world would be under the hopes that others would too and that in such a forum, those relations could be repurposed upstream in my application as well. Starting in DSpace means that we can at least begin to enable the application to support and participate in a semantic web, even it it is initially only shared with other DSpace instances. Cheers, Mark ~~~~~~~~~~~~~ Mark R. Diggory - DSpace Developer and Systems Manager MIT Libraries, Systems and Technology Services Massachusetts Institute of Technology _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
