Mark Diggory wrote:
> Thanks Ian,
> 
> On Mar 4, 2008, at 4:35 PM, Ian Boston wrote:
> 
>> Mark,
>> I am not not a Longwell developer and my comment is a diversion from
>> the core of your question, however using collected metadata for the
>> disambiguation of Author names has a significant amount of merit,
>> provided that the corpus of information relating to the author in
>> question contains a small enough range of name variants relative to
>> the size of collected information.
>> By collected information I mean publications and associated metadata
>> that increase the level of confidence that 2 names are in-fact the
>> same author. I was at a CNI Workshop in Washington at the end of last
>> month where this was discussed, and IMVHO (:)) provided that there is
>> some control over the generation potential names, a sufficiently
>> populated tripple store will contain the information and the
>> capabilities to generate links or additional relationships between
>> names with a level of confidence that they are the same. Obviously
>> Longwells term vectors (I did get that right, it has term vectors
>> doesn't it ?) may be able to cluster an authors style and language to
>> add to the confidence.
>>
>> Another factor that might help you in your search is other identity
>> references that might not originate from within your own store of
>> names pointers, but be represented as external URI's with a level of
>> trust/confidence, eg OpenSocial FOAF stores, OpenID etc.
>>
>> As I said, not a Longwell developer, but an interested observer.
>> Ian
> 
> 
> I think the community is banking on what your suggesting being the  
> case... at least if this diagram is accurate concerning Linked  
> Data... ;-)
> 
> http://en.wikipedia.org/wiki/Image:Linking-Open-Data-diagram_2007-09.png
> 
> But, I still ponder that our current case seems "misapplied".  We  
> loosely manage a set of content in one DSpace instance, if we could  
> just get it under better management and correct the variation at its  
> source, we may be better off than doing it using inference. We are  
> investing allot to make disambiguities and equivalences on data for  
> which we already control the publication of.  Rather than have it  
> isolated to some "presentation layer", I want the result of that  
> effort to "stick", I 'd rather see the content corrected at its  
> source.  I feel that inference just isn't a replacement for better  
> management of the metadata within our own system.
> 
> Using these sort of equivalencies seems much more applicable to cases  
> where you do not actually manage the original data, not as a  
> replacement for actively maintaining and cleaning up ones data  
> locally in a controlled service such as DSpace.  It would be great if  
> such identity references/services could ultimately be used as  
> authorities to assist curators in cleaning up their metadata (or in  
> reducing the variance coming from users in the first place).   If  
> proper feedbacks could be created in a tool such as dspace to place  
> authority controls (or at least suggested values) on such metadata  
> during submission/update, I think they would have allot of "traction"  
> in the community.

I personally think that it's very easy to underestimate the amount of 
energy that goes into "mandating coherence".

Any dataset is an exercise of data integration, even in the simple case 
of a single dataset with data entry performed by more than one 
individual (not necessarily at the same time!).

"Hal Abelson" and "H. Abelson" and "Harold Abelson" and "Abelson, 
Harold" are all correct forms and unless all the people doing data entry 
  (or data entry oversight, or batch upload, or data integration or data 
conversion + upload) knows precisely which one is the mandated canonical 
(and never makes mistakes!), you're doomed to have multiple forms anyway.

The natural tendency is to have a 'validation' phase that makes sure 
that only 'clean and coherent' data enters the system. Of course, it is 
naive to think that this would actually work at any reasonable scale 
without an exponentially growing maintenance cost.

(yes, libraries and museum all do that... last time I checked, they were 
all complaining about the cost and inadequacy of their metadata management)

The use of equivalences is, IMO, not a UI hack but an entirely different 
paradigm that 'embraces' diversity and entropic variability as a fact of 
life (think thermodynamics) and doesn't treat it as a problem.

By saying that "Hal Abelson" and "H. Abelson" are actually different 
"labels" of the same "entity", and recording that information alongside 
your data, you are not only correcting today's incoherence but 
preventing future one from repeating as well.

It's a 'pave the cowpath' approach: see what incoherences exist and deal 
with them, in a way that is reproducible and with information that can 
be shared and repurposed.

Sure it seems easier to write and run a perl script that hits a 
controlled vocabulary, or the MIT directory or a thesaurus or a 
gazetteer... but *that*, IMO, it's the hack, the technological band-aid 
to a deeper, intrinsic, dynamic of data maintenance.

Of course, both approaches can cohexist but be careful about thinking 
that equivalences are useful only when you lack control.... I've found 
out that exercising control is always much harder than it looks, even 
when, on paper, you have plenty.

-- 
Stefano Mazzocchi
Digital Libraries Research Group                 Research Scientist
Massachusetts Institute of Technology
E25-131, 77 Massachusetts Ave               skype: stefanomazzocchi
Cambridge, MA  02139-4307, USA         email: stefanom at mit . edu
-------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Reply via email to