Re: [CODE4LIB] Metadata war stories...

2012-01-29 Thread Stephen Meyer
 intended for a single
institution, or worse, a specific OPAC.

Due to the ambiguity in the spec and the desire to just make it look the
way I want it to look in my OPAC, the temptation is simply too great. In
the end, we have data that couldn't possibly meet the standard as it is
described and means that we spend more time than we expected parsing it in
the next system.

In our case we work through these issues with an army of code tests. Our
catalogers and reference staff find broken examples of MARC holdings data
parsing in our newest discovery system, we gather the real-world MARC
records as a test data set and then we write a bunch of Rspec tests so we
don't undo previous bug fixes as we deal with the current ones. The
challenge is coming up with a fast and responsive mechanism/process for
adding a record to the test set once identified.

-Steve

Bess Sadler wrote, On 1/27/12 8:26 PM:

  I remember the required field operation of... aught six? aught seven?

It all runs together at my age. Turns out, for years people had been making
shell catalog records for items in the collection that needed to be checked
out but hadn't yet been barcoded. Some percentage of these people opted not
to record any information about the item other than the barcode it left the
building under, presumably because they were in a hurry. If there was
such a thing as a metadata crime, that'd be it.

We were young and naive, we thought why not just index all our catalog
records into solr? Little did we know what unholy abominations we would
uncover. Out of nowhere, we were surrounded by zombie marc records,
horrible half-created things, never meant to roam the earth or even to
exist in a sane mind. They could tell us nothing about who they were, what
book they had once tried to describe, they could only stare blankly and
repeat in mangled agony required field! required field! required
field! over and over…

It took us weeks to put them all out of their misery.

This is the first time I've ever spoken of this publicly. The support
group is helping with the nightmares, but sometimes still, I wake in a cold
sweat, wondering… did we really find them all?


On Jan 27, 2012, at 4:28 PM, Ethan Gruber wrote:

  EDIT ME


http://ead.lib.virginia.edu/**vivaxtf/view?docId=uva-sc/**
viu00888.xml;query=;brand=**default#adminlinkhttp://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink

On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennantroytenn...@gmail.com
  wrote:

  Oh, I should have also mentioned that some of the worst problems occur

when people treat their metadata like it will never leave their
institution. When that happens you get all kinds of crazy cruft in a
record. For example, just off the top of my head:

* Embedded HTML markup (one of my favorites is animg   tag)
* URLs to remote resources that are hard-coded to go through a
particular institution's proxy
* Notes that only have meaning for that institution
* Text that is meant to display to the end-user but may only do so in
certain systems; e.g., Click here in a particular subfield.

Sigh...
Roy

On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennantroytenn...@gmail.com
  wrote:


Thanks a lot for the kind shout-out Leslie. I have been pondering what
I might propose to discuss at this event, since there is certainly
plenty of fodder. Recently we (OCLC Research) did an investigation of
856 fields in WorldCat (some 40 million of them) and that might prove
interesting. By the time ALA rolls around there may something else
entirely I could talk about.

That's one of the wonderful things about having 250 million MARC
records sitting out on a 32-node cluster. There are any number of
potentially interesting investigations one could do.
Roy

On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslielesl...@loc.gov


wrote:


Roy's fabulous Bitter Harvest paper:



http://roytennant.com/bitter_**harvest.htmlhttp://roytennant.com/bitter_harvest.html





-Original Message-
From: Code for Libraries 
[mailto:code4...@listserv.nd.**EDUCODE4LIB@LISTSERV.ND.EDU]
On Behalf


Of Walter Lewis



Sent: Wednesday, January 25, 2012 1:38 PM

To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Metadata war stories...

On 2012-01-25, at 10:06 AM, Becky Yoose wrote:

  - Dirty data issues when switching discovery layers or using

legacy/vendor metadata (ex. HathiTrust)



I have a sharp recollection of a slide in a presentation Roy Tennant


offered up at Access  (at Halifax, maybe), where he offered up a range

of
dates extracted from an array of OAI harvested records.  The good, the
bad,
the incomprehensible, the useless-without-context (01/02/03 anyone?)
and on
and on.  In my years of migrating data, I've seen most of those
variants.
(except ones *intended* to be BCE).




Then there are the fielded data sets without authority control.  My


favourite example comes from staff who nominally worked for me, so I'm

not
telling tales out of school.  The classic Dynix product

Re: [CODE4LIB] Metadata war stories...

2012-01-28 Thread Stephen Meyer
] On Behalf

Of Walter Lewis

Sent: Wednesday, January 25, 2012 1:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Metadata war stories...

On 2012-01-25, at 10:06 AM, Becky Yoose wrote:


- Dirty data issues when switching discovery layers or using
legacy/vendor metadata (ex. HathiTrust)


I have a sharp recollection of a slide in a presentation Roy Tennant

offered up at Access  (at Halifax, maybe), where he offered up a range of
dates extracted from an array of OAI harvested records.  The good, the bad,
the incomprehensible, the useless-without-context (01/02/03 anyone?) and on
and on.  In my years of migrating data, I've seen most of those variants.
(except ones *intended* to be BCE).


Then there are the fielded data sets without authority control.  My

favourite example comes from staff who nominally worked for me, so I'm not
telling tales out of school.  The classic Dynix product had a Newspaper
index module that we used before migrating it (PICK migrations; such a
joy).  One title had twenty variations on Georgetown Independent (I wish
I was kidding) and the dates ranged from the early ninth century until
nearly the 3rd millenium. (apparently there hasn't been much change in
local council over the centuries).


I've come to the point where I hand-walk the spatial metadata to links

with to geonames.org for the linked open data. Never had to do it for a
set with more than 40,000 entries though.  The good news is that it isn't
hard to establish a valid additional entry when one is required.


Walter




Re: [CODE4LIB] Metadata war stories...

2012-01-28 Thread David Fiander
-coded to go through a
 particular institution's proxy
 * Notes that only have meaning for that institution
 * Text that is meant to display to the end-user but may only do so in
 certain systems; e.g., Click here in a particular subfield.

 Sigh...
 Roy

 On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennantroytenn...@gmail.com
  wrote:

 Thanks a lot for the kind shout-out Leslie. I have been pondering what
 I might propose to discuss at this event, since there is certainly
 plenty of fodder. Recently we (OCLC Research) did an investigation of
 856 fields in WorldCat (some 40 million of them) and that might prove
 interesting. By the time ALA rolls around there may something else
 entirely I could talk about.

 That's one of the wonderful things about having 250 million MARC
 records sitting out on a 32-node cluster. There are any number of
 potentially interesting investigations one could do.
 Roy

 On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslielesl...@loc.gov

 wrote:

 Roy's fabulous Bitter Harvest paper:

 http://roytennant.com/bitter_**harvest.htmlhttp://roytennant.com/bitter_harvest.html


 -Original Message-
 From: Code for Libraries 
 [mailto:code4...@listserv.nd.**EDUCODE4LIB@LISTSERV.ND.EDU]
 On Behalf

 Of Walter Lewis

 Sent: Wednesday, January 25, 2012 1:38 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Metadata war stories...

 On 2012-01-25, at 10:06 AM, Becky Yoose wrote:

  - Dirty data issues when switching discovery layers or using
 legacy/vendor metadata (ex. HathiTrust)


 I have a sharp recollection of a slide in a presentation Roy Tennant

 offered up at Access  (at Halifax, maybe), where he offered up a range
 of
 dates extracted from an array of OAI harvested records.  The good, the
 bad,
 the incomprehensible, the useless-without-context (01/02/03 anyone?)
 and on
 and on.  In my years of migrating data, I've seen most of those
 variants.
 (except ones *intended* to be BCE).


 Then there are the fielded data sets without authority control.  My

 favourite example comes from staff who nominally worked for me, so I'm
 not
 telling tales out of school.  The classic Dynix product had a Newspaper
 index module that we used before migrating it (PICK migrations; such a
 joy).  One title had twenty variations on Georgetown Independent (I
 wish
 I was kidding) and the dates ranged from the early ninth century until
 nearly the 3rd millenium. (apparently there hasn't been much change in
 local council over the centuries).


 I've come to the point where I hand-walk the spatial metadata to links

 with to geonames.org for the linked open data. Never had to do it for
 a
 set with more than 40,000 entries though.  The good news is that it
 isn't
 hard to establish a valid additional entry when one is required.


 Walter





Re: [CODE4LIB] Metadata war stories...

2012-01-28 Thread Bill Dueber
://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink
 
 
  On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennantroytenn...@gmail.com
   wrote:
 
   Oh, I should have also mentioned that some of the worst problems occur
  when people treat their metadata like it will never leave their
  institution. When that happens you get all kinds of crazy cruft in a
  record. For example, just off the top of my head:
 
  * Embedded HTML markup (one of my favorites is animg  tag)
  * URLs to remote resources that are hard-coded to go through a
  particular institution's proxy
  * Notes that only have meaning for that institution
  * Text that is meant to display to the end-user but may only do so in
  certain systems; e.g., Click here in a particular subfield.
 
  Sigh...
  Roy
 
  On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennantroytenn...@gmail.com
   wrote:
 
  Thanks a lot for the kind shout-out Leslie. I have been pondering
 what
  I might propose to discuss at this event, since there is certainly
  plenty of fodder. Recently we (OCLC Research) did an investigation of
  856 fields in WorldCat (some 40 million of them) and that might prove
  interesting. By the time ALA rolls around there may something else
  entirely I could talk about.
 
  That's one of the wonderful things about having 250 million MARC
  records sitting out on a 32-node cluster. There are any number of
  potentially interesting investigations one could do.
  Roy
 
  On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslielesl...@loc.gov
 
  wrote:
 
  Roy's fabulous Bitter Harvest paper:
 
  http://roytennant.com/bitter_**harvest.html
 http://roytennant.com/bitter_harvest.html
 
 
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.**EDU
 CODE4LIB@LISTSERV.ND.EDU]
  On Behalf
 
  Of Walter Lewis
 
  Sent: Wednesday, January 25, 2012 1:38 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Metadata war stories...
 
  On 2012-01-25, at 10:06 AM, Becky Yoose wrote:
 
   - Dirty data issues when switching discovery layers or using
  legacy/vendor metadata (ex. HathiTrust)
 
 
  I have a sharp recollection of a slide in a presentation Roy Tennant
 
  offered up at Access  (at Halifax, maybe), where he offered up a
 range
  of
  dates extracted from an array of OAI harvested records.  The good, the
  bad,
  the incomprehensible, the useless-without-context (01/02/03 anyone?)
  and on
  and on.  In my years of migrating data, I've seen most of those
  variants.
  (except ones *intended* to be BCE).
 
 
  Then there are the fielded data sets without authority control.  My
 
  favourite example comes from staff who nominally worked for me, so
 I'm
  not
  telling tales out of school.  The classic Dynix product had a
 Newspaper
  index module that we used before migrating it (PICK migrations; such a
  joy).  One title had twenty variations on Georgetown Independent (I
  wish
  I was kidding) and the dates ranged from the early ninth century until
  nearly the 3rd millenium. (apparently there hasn't been much change in
  local council over the centuries).
 
 
  I've come to the point where I hand-walk the spatial metadata to
 links
 
  with to geonames.org for the linked open data. Never had to do it
 for
  a
  set with more than 40,000 entries though.  The good news is that it
  isn't
  hard to establish a valid additional entry when one is required.
 
 
  Walter
 
 
 




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Metadata war stories...

2012-01-27 Thread Roy Tennant
Oh, I should have also mentioned that some of the worst problems occur
when people treat their metadata like it will never leave their
institution. When that happens you get all kinds of crazy cruft in a
record. For example, just off the top of my head:

* Embedded HTML markup (one of my favorites is an img tag)
* URLs to remote resources that are hard-coded to go through a
particular institution's proxy
* Notes that only have meaning for that institution
* Text that is meant to display to the end-user but may only do so in
certain systems; e.g., Click here in a particular subfield.

Sigh...
Roy

On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennant roytenn...@gmail.com wrote:
 Thanks a lot for the kind shout-out Leslie. I have been pondering what
 I might propose to discuss at this event, since there is certainly
 plenty of fodder. Recently we (OCLC Research) did an investigation of
 856 fields in WorldCat (some 40 million of them) and that might prove
 interesting. By the time ALA rolls around there may something else
 entirely I could talk about.

 That's one of the wonderful things about having 250 million MARC
 records sitting out on a 32-node cluster. There are any number of
 potentially interesting investigations one could do.
 Roy

 On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslie lesl...@loc.gov wrote:
 Roy's fabulous Bitter Harvest paper:  
 http://roytennant.com/bitter_harvest.html

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Walter Lewis
 Sent: Wednesday, January 25, 2012 1:38 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Metadata war stories...

 On 2012-01-25, at 10:06 AM, Becky Yoose wrote:

 - Dirty data issues when switching discovery layers or using
 legacy/vendor metadata (ex. HathiTrust)

 I have a sharp recollection of a slide in a presentation Roy Tennant offered 
 up at Access  (at Halifax, maybe), where he offered up a range of dates 
 extracted from an array of OAI harvested records.  The good, the bad, the 
 incomprehensible, the useless-without-context (01/02/03 anyone?) and on and 
 on.  In my years of migrating data, I've seen most of those variants.  
 (except ones *intended* to be BCE).

 Then there are the fielded data sets without authority control.  My 
 favourite example comes from staff who nominally worked for me, so I'm not 
 telling tales out of school.  The classic Dynix product had a Newspaper 
 index module that we used before migrating it (PICK migrations; such a joy). 
  One title had twenty variations on Georgetown Independent (I wish I was 
 kidding) and the dates ranged from the early ninth century until nearly the 
 3rd millenium. (apparently there hasn't been much change in local council 
 over the centuries).

 I've come to the point where I hand-walk the spatial metadata to links with 
 to geonames.org for the linked open data. Never had to do it for a set with 
 more than 40,000 entries though.  The good news is that it isn't hard to 
 establish a valid additional entry when one is required.

 Walter


Re: [CODE4LIB] Metadata war stories...

2012-01-27 Thread Ethan Gruber
EDIT ME

http://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink

On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennant roytenn...@gmail.com wrote:

 Oh, I should have also mentioned that some of the worst problems occur
 when people treat their metadata like it will never leave their
 institution. When that happens you get all kinds of crazy cruft in a
 record. For example, just off the top of my head:

 * Embedded HTML markup (one of my favorites is an img tag)
 * URLs to remote resources that are hard-coded to go through a
 particular institution's proxy
 * Notes that only have meaning for that institution
 * Text that is meant to display to the end-user but may only do so in
 certain systems; e.g., Click here in a particular subfield.

 Sigh...
 Roy

 On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennant roytenn...@gmail.com wrote:
  Thanks a lot for the kind shout-out Leslie. I have been pondering what
  I might propose to discuss at this event, since there is certainly
  plenty of fodder. Recently we (OCLC Research) did an investigation of
  856 fields in WorldCat (some 40 million of them) and that might prove
  interesting. By the time ALA rolls around there may something else
  entirely I could talk about.
 
  That's one of the wonderful things about having 250 million MARC
  records sitting out on a 32-node cluster. There are any number of
  potentially interesting investigations one could do.
  Roy
 
  On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslie lesl...@loc.gov
 wrote:
  Roy's fabulous Bitter Harvest paper:
 http://roytennant.com/bitter_harvest.html
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of Walter Lewis
  Sent: Wednesday, January 25, 2012 1:38 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Metadata war stories...
 
  On 2012-01-25, at 10:06 AM, Becky Yoose wrote:
 
  - Dirty data issues when switching discovery layers or using
  legacy/vendor metadata (ex. HathiTrust)
 
  I have a sharp recollection of a slide in a presentation Roy Tennant
 offered up at Access  (at Halifax, maybe), where he offered up a range of
 dates extracted from an array of OAI harvested records.  The good, the bad,
 the incomprehensible, the useless-without-context (01/02/03 anyone?) and on
 and on.  In my years of migrating data, I've seen most of those variants.
  (except ones *intended* to be BCE).
 
  Then there are the fielded data sets without authority control.  My
 favourite example comes from staff who nominally worked for me, so I'm not
 telling tales out of school.  The classic Dynix product had a Newspaper
 index module that we used before migrating it (PICK migrations; such a
 joy).  One title had twenty variations on Georgetown Independent (I wish
 I was kidding) and the dates ranged from the early ninth century until
 nearly the 3rd millenium. (apparently there hasn't been much change in
 local council over the centuries).
 
  I've come to the point where I hand-walk the spatial metadata to links
 with to geonames.org for the linked open data. Never had to do it for a
 set with more than 40,000 entries though.  The good news is that it isn't
 hard to establish a valid additional entry when one is required.
 
  Walter



Re: [CODE4LIB] Metadata war stories...

2012-01-26 Thread Johnston, Leslie
Roy's fabulous Bitter Harvest paper:  
http://roytennant.com/bitter_harvest.html 

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Walter 
Lewis
Sent: Wednesday, January 25, 2012 1:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Metadata war stories...

On 2012-01-25, at 10:06 AM, Becky Yoose wrote:

 - Dirty data issues when switching discovery layers or using 
 legacy/vendor metadata (ex. HathiTrust)

I have a sharp recollection of a slide in a presentation Roy Tennant offered up 
at Access  (at Halifax, maybe), where he offered up a range of dates extracted 
from an array of OAI harvested records.  The good, the bad, the 
incomprehensible, the useless-without-context (01/02/03 anyone?) and on and on. 
 In my years of migrating data, I've seen most of those variants.  (except ones 
*intended* to be BCE).  

Then there are the fielded data sets without authority control.  My favourite 
example comes from staff who nominally worked for me, so I'm not telling tales 
out of school.  The classic Dynix product had a Newspaper index module that we 
used before migrating it (PICK migrations; such a joy).  One title had twenty 
variations on Georgetown Independent (I wish I was kidding) and the dates 
ranged from the early ninth century until nearly the 3rd millenium. (apparently 
there hasn't been much change in local council over the centuries).

I've come to the point where I hand-walk the spatial metadata to links with to 
geonames.org for the linked open data. Never had to do it for a set with more 
than 40,000 entries though.  The good news is that it isn't hard to establish a 
valid additional entry when one is required.

Walter


[CODE4LIB] Metadata war stories...

2012-01-25 Thread Becky Yoose
Hi all,

For our preconference, “Digging into Metadata,” we’d like to get a little
discussion going to build on once the preconference rolls around.

A good part of our discussion will focus on metadata issues and how folks
have worked through said issues or have utilized metadata in an unique way
while keeping the metadata’s context in mind. Some example include:

- Dirty data issues when switching discovery layers or using legacy/vendor
metadata (ex. HathiTrust)
- Dealing with free text in MARC records and how to parse them w/o too much
heartache
- batch creating and editing metadata

Some of you have already touched on this in the last preconference email
thread, but we'd like to get some more examples to focus on. What are your
metadata war stories?

Thanks,
Becky

-
Becky Yoose
Systems Librarian
Grinnell College


Re: [CODE4LIB] Metadata war stories...

2012-01-25 Thread Derek Merleaux
I will contribute one particularly heartbreaking bit from my own
current metadata saga - I'm in one of these hybrid museum/research
library institutions where the library side has a aging MARC catalog
with its own issues that I won't go into at the moment. The museum
side has a commercial collection management database that recently
changed names from ReDiscovery to Proficio. The good news about this
database is that after some digging I uncovered an export method that
is fairly free-form and allows me to write a template to export
directly to MODS xml which is my intended middle ground between
library and museum (the only trick is getting your hands on the Top
Sekrit database field names). The bad - actually painful news was
discovering how data that had been painstakingly entered by hand over
15 years into separated fields was being munged together as free text
within the database. Nobody knew this was happening until I started
trying to export data. So, for example, a name and its associated role
and dates would have been entered into appropriate separate authority
controlled fields in a data-entry form but then would be stuffed into
a single field in the database. The only consolation is that they do
stuff in some text delimiters that are (mostly) uncommon characters
(pipes and underscores) so it is possible to break the fields back
out, just very time consuming and prone to introducing errors.
Lesson learned: vigorously test how well the data comes out of any
system before investing any time putting data into it. Also invest in
time travel to go back and apply this lesson at the beginning...
-Derek
@dmer

On Wed, Jan 25, 2012 at 10:06 AM, Becky Yoose b.yo...@gmail.com wrote:
 Hi all,

 For our preconference, “Digging into Metadata,” we’d like to get a little
 discussion going to build on once the preconference rolls around.

 A good part of our discussion will focus on metadata issues and how folks
 have worked through said issues or have utilized metadata in an unique way
 while keeping the metadata’s context in mind. Some example include:

 - Dirty data issues when switching discovery layers or using legacy/vendor
 metadata (ex. HathiTrust)
 - Dealing with free text in MARC records and how to parse them w/o too much
 heartache
 - batch creating and editing metadata

 Some of you have already touched on this in the last preconference email
 thread, but we'd like to get some more examples to focus on. What are your
 metadata war stories?

 Thanks,
 Becky

 -
 Becky Yoose
 Systems Librarian
 Grinnell College


Re: [CODE4LIB] Metadata war stories...

2012-01-25 Thread Kyle Banerjee
 For our preconference, “Digging into Metadata,” we’d like to get a little
 discussion going to build on once the preconference rolls around.

 ...
 - Dealing with free text in MARC records and how to parse them w/o too much
 heartache


You can find horrendous stories even with data that's fully structured.
Multiple libraries have had call numbers not migrated (or the wrong one
migrated due to the unfortunate practice of most libraries to retain
multiple call numbers) during an ILS migration -- as you can imagine, that
would make books much harder to find on the shelves. I can't remember the
names of institutions this happened to, but you could probably find someone
who can give you precise details on the autocat list.

There is the constant problem that in any migration, the data is not
structured/used the same way in the new system as in the old -- some fields
exist in one system but not the other, different numbers/types of fields
are used to represent concepts, etc.

I've personally encountered cases where the data that comes out of a system
is outright invalid or gets mangled in bizarre ways by the export routine
itself. For example, there's a system used for many digital archives that
splits a field in two anytime a field that needs to be represented by an
XML entity is encountered. Name withheld to protect the guilty.

kyle


Re: [CODE4LIB] Metadata war stories...

2012-01-25 Thread Walter Lewis
On 2012-01-25, at 10:06 AM, Becky Yoose wrote:

 - Dirty data issues when switching discovery layers or using legacy/vendor
 metadata (ex. HathiTrust)

I have a sharp recollection of a slide in a presentation Roy Tennant offered up 
at Access  (at Halifax, maybe), where he offered up a range of dates extracted 
from an array of OAI harvested records.  The good, the bad, the 
incomprehensible, the useless-without-context (01/02/03 anyone?) and on and on. 
 In my years of migrating data, I've seen most of those variants.  (except ones 
*intended* to be BCE).  

Then there are the fielded data sets without authority control.  My favourite 
example comes from staff who nominally worked for me, so I'm not telling tales 
out of school.  The classic Dynix product had a Newspaper index module that we 
used before migrating it (PICK migrations; such a joy).  One title had twenty 
variations on Georgetown Independent (I wish I was kidding) and the dates 
ranged from the early ninth century until nearly the 3rd millenium. (apparently 
there hasn't been much change in local council over the centuries).

I've come to the point where I hand-walk the spatial metadata to links with to 
geonames.org for the linked open data. Never had to do it for a set with more 
than 40,000 entries though.  The good news is that it isn't hard to establish a 
valid additional entry when one is required.

Walter

Re: [CODE4LIB] Metadata war stories...

2012-01-25 Thread Chris Fitzpatrick
I was part of a particularly long siege during the METS offensive back in '08. 
It was brutal. We pretty much ran out of everything and were fighting 
hand-to-hand before the whole thing was over.

I remember toward the end, while out on requirement gathering patrol, my team 
came up on a group of rouge library staff who had separated from their 
cataloging unit. They were just sitting there, literally a few feet away, 
taking a chow break. We were heavily outnumbered and out-gunned, but it was a 
dark night, so I hoped we could just lie low and let them pass. But they 
started talking about how they were plotting a move to take out our dmdSec with 
some kind of RDF improvised explosive devise. I knew this would set us back 
months and would result in a great loss of many of my fellow developers and 
librarians. So, I ordered my team into action…since we had surprise on our 
side, we were able to even the numbers by taking out several of their squad. 
Their manager order them to fall back and they retreated up a hill. Several of 
my team started whooping and hollering like we'd won something, but I knew they 
were just regrouping to hit back at us.

And, boy, did they ever hit back. We had a prolonged shoot out.  I knew they 
longer this went, the more likely they'd be able to call in reinforcements or 
possibly get us with a Faculty-lead napalm strike. So, I made the quick 
decision to charge their position. We bounced up the hill, taking cover behind 
trees, rocks, corpses, and whatever we could. We took heavy fire, but we got to 
the top. And that's when all hell broke lose. 

I've killed my fair share of people. In combat, you just learn to live with 
that. But there's something about strangling someone with your bare hands that 
just leaves a lasting impression. What happened on that hill comes back to me 
like nothing else. The screams and the faces and the smell. I talked to that 
doc and went to some ALA conferences, but whiskey seems to be the only thing 
that helps. 

They say we won that war, but most of the time I'm not sure we did….war's not 
over for me. It's never over. 



On Jan 25, 2012, at 10:13 AM, Kyle Banerjee wrote:

 For our preconference, “Digging into Metadata,” we’d like to get a little
 discussion going to build on once the preconference rolls around.
 
 ...
 - Dealing with free text in MARC records and how to parse them w/o too much
 heartache
 
 
 You can find horrendous stories even with data that's fully structured.
 Multiple libraries have had call numbers not migrated (or the wrong one
 migrated due to the unfortunate practice of most libraries to retain
 multiple call numbers) during an ILS migration -- as you can imagine, that
 would make books much harder to find on the shelves. I can't remember the
 names of institutions this happened to, but you could probably find someone
 who can give you precise details on the autocat list.
 
 There is the constant problem that in any migration, the data is not
 structured/used the same way in the new system as in the old -- some fields
 exist in one system but not the other, different numbers/types of fields
 are used to represent concepts, etc.
 
 I've personally encountered cases where the data that comes out of a system
 is outright invalid or gets mangled in bizarre ways by the export routine
 itself. For example, there's a system used for many digital archives that
 splits a field in two anytime a field that needs to be represented by an
 XML entity is encountered. Name withheld to protect the guilty.
 
 kyle