Quoting Kevin M Randall <k...@northwestern.edu> (in part):

If the data structure does not allow for determining unambiguous relationships between the pieces of data, that places limits on *any* kind of search engine. As wonderful as Lucene may be, it cannot possibly determine the relationships between pieces of data in a document if that document's structure does not label those relationships. A computer cannot work with something that simply isn't there.

It is however possible for data encoded in somewhat different systems to be cross-correlated. For an example, see VIAF <http://viaf.org/> and try the name of a favorite author. The initial work of correlating the data from the LC/NAF and the German authority files and the associated bibliographic records was so effective that it revealed thousands of errors in the LC/NAF -- duplicates, false attributions, errors with undifferentiated name records. There are limits, of course.

It's not always necessary to bring existing data exactly into line. For the future, of course, a standard format consistently applied is clearly the way to go; and reprocessing existing data to achieve a closer match to the new standard may be worthwhile -- but at whose cost?

Hal Cain
Melbourne, Australia
hec...@dml.vic.edu.au

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Reply via email to