Mark Diggory wrote: > Thanks Ian, > > On Mar 4, 2008, at 4:35 PM, Ian Boston wrote: > >> Mark, >> I am not not a Longwell developer and my comment is a diversion from >> the core of your question, however using collected metadata for the >> disambiguation of Author names has a significant amount of merit, >> provided that the corpus of information relating to the author in >> question contains a small enough range of name variants relative to >> the size of collected information. >> By collected information I mean publications and associated metadata >> that increase the level of confidence that 2 names are in-fact the >> same author. I was at a CNI Workshop in Washington at the end of last >> month where this was discussed, and IMVHO (:)) provided that there is >> some control over the generation potential names, a sufficiently >> populated tripple store will contain the information and the >> capabilities to generate links or additional relationships between >> names with a level of confidence that they are the same. Obviously >> Longwells term vectors (I did get that right, it has term vectors >> doesn't it ?) may be able to cluster an authors style and language to >> add to the confidence. >> >> Another factor that might help you in your search is other identity >> references that might not originate from within your own store of >> names pointers, but be represented as external URI's with a level of >> trust/confidence, eg OpenSocial FOAF stores, OpenID etc. >> >> As I said, not a Longwell developer, but an interested observer. >> Ian > > > I think the community is banking on what your suggesting being the > case... at least if this diagram is accurate concerning Linked > Data... ;-) > > http://en.wikipedia.org/wiki/Image:Linking-Open-Data-diagram_2007-09.png > > But, I still ponder that our current case seems "misapplied". We > loosely manage a set of content in one DSpace instance, if we could > just get it under better management and correct the variation at its > source, we may be better off than doing it using inference. We are > investing allot to make disambiguities and equivalences on data for > which we already control the publication of. Rather than have it > isolated to some "presentation layer", I want the result of that > effort to "stick", I 'd rather see the content corrected at its > source. I feel that inference just isn't a replacement for better > management of the metadata within our own system. > > Using these sort of equivalencies seems much more applicable to cases > where you do not actually manage the original data, not as a > replacement for actively maintaining and cleaning up ones data > locally in a controlled service such as DSpace. It would be great if > such identity references/services could ultimately be used as > authorities to assist curators in cleaning up their metadata (or in > reducing the variance coming from users in the first place). If > proper feedbacks could be created in a tool such as dspace to place > authority controls (or at least suggested values) on such metadata > during submission/update, I think they would have allot of "traction" > in the community.
I personally think that it's very easy to underestimate the amount of energy that goes into "mandating coherence". Any dataset is an exercise of data integration, even in the simple case of a single dataset with data entry performed by more than one individual (not necessarily at the same time!). "Hal Abelson" and "H. Abelson" and "Harold Abelson" and "Abelson, Harold" are all correct forms and unless all the people doing data entry (or data entry oversight, or batch upload, or data integration or data conversion + upload) knows precisely which one is the mandated canonical (and never makes mistakes!), you're doomed to have multiple forms anyway. The natural tendency is to have a 'validation' phase that makes sure that only 'clean and coherent' data enters the system. Of course, it is naive to think that this would actually work at any reasonable scale without an exponentially growing maintenance cost. (yes, libraries and museum all do that... last time I checked, they were all complaining about the cost and inadequacy of their metadata management) The use of equivalences is, IMO, not a UI hack but an entirely different paradigm that 'embraces' diversity and entropic variability as a fact of life (think thermodynamics) and doesn't treat it as a problem. By saying that "Hal Abelson" and "H. Abelson" are actually different "labels" of the same "entity", and recording that information alongside your data, you are not only correcting today's incoherence but preventing future one from repeating as well. It's a 'pave the cowpath' approach: see what incoherences exist and deal with them, in a way that is reproducible and with information that can be shared and repurposed. Sure it seems easier to write and run a perl script that hits a controlled vocabulary, or the MIT directory or a thesaurus or a gazetteer... but *that*, IMO, it's the hack, the technological band-aid to a deeper, intrinsic, dynamic of data maintenance. Of course, both approaches can cohexist but be careful about thinking that equivalences are useful only when you lack control.... I've found out that exercising control is always much harder than it looks, even when, on paper, you have plenty. -- Stefano Mazzocchi Digital Libraries Research Group Research Scientist Massachusetts Institute of Technology E25-131, 77 Massachusetts Ave skype: stefanomazzocchi Cambridge, MA 02139-4307, USA email: stefanom at mit . edu ------------------------------------------------------------------- _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
