On Wed, Apr 11, 2012 at 7:00 PM, Ben Companjen <[email protected]> wrote:

> In the most recent dump file, I found 7467 different values for the
> (physical) format field in the editions. Every variation in
> capitalization, punctuation, spacing etc. is counted.
>
> Using Google Refine and its clustering functions I brought that number
> down to about 4800, and saw that if my proposed changes were to be
> executed, about 1 million records would be changed. The majority of
> these changes involve variations of "microform".
>
> I was wondering how other see this large number of "formats". Is it
> worth trying to "fix" them? Has anyone ever tried?

I'm more concerned with duplicate (or worse, conflated) authors than a
little cruft in a field that most people don't use, but if these are
going to be fixed, we should make sure we don't lose data.

Many of the [microform] "formats" appear to be the result of bad
parses and include other information such as subtitles, contribution
statements, etc.  Is this data duplicated elsewhere in the record
already or does it need to be moved as part of this process?

Also, there appear to be some character encoding issues, but it's
unclear whether it's just in the web page with the listing or in the
original source data.  Typically this is the result of UTF-8 encoded
data being interpreted as ISO Latin-1.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to