On 16 April 2012 13:43, Tom Morris <[email protected]> wrote: > On Wed, Apr 11, 2012 at 7:00 PM, Ben Companjen <[email protected]> wrote: > >> In the most recent dump file, I found 7467 different values for the >> (physical) format field in the editions. Every variation in >> capitalization, punctuation, spacing etc. is counted. >> >> Using Google Refine and its clustering functions I brought that number >> down to about 4800, and saw that if my proposed changes were to be >> executed, about 1 million records would be changed. The majority of >> these changes involve variations of "microform". >> >> I was wondering how other see this large number of "formats". Is it >> worth trying to "fix" them? Has anyone ever tried? > > I'm more concerned with duplicate (or worse, conflated) authors than a > little cruft in a field that most people don't use, but if these are > going to be fixed, we should make sure we don't lose data.
I agree on not losing data. I've tried for some of the "formats" to move the byline, dimensions, language to a separate column, but there is more (manual?) checking to be done before the non-trivial items on the list can be replaced. > > Many of the [microform] "formats" appear to be the result of bad > parses and include other information such as subtitles, contribution > statements, etc. Is this data duplicated elsewhere in the record > already or does it need to be moved as part of this process? I checked for some of the languages (e.g. "[chinese].") and found the language already indicated on the records, but I don't know if that holds for all languages. The Byline is usually not in the OL records when it was part of the $h subfield. > > Also, there appear to be some character encoding issues, but it's > unclear whether it's just in the web page with the listing or in the > original source data. Typically this is the result of UTF-8 encoded > data being interpreted as ISO Latin-1. I added the BOM to the webpage, so it may be better now. > > Tom Ben > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
