On 16 April 2012 13:43, Tom Morris <[email protected]> wrote:
> On Wed, Apr 11, 2012 at 7:00 PM, Ben Companjen <[email protected]> wrote:
>
>> In the most recent dump file, I found 7467 different values for the
>> (physical) format field in the editions. Every variation in
>> capitalization, punctuation, spacing etc. is counted.
>>
>> Using Google Refine and its clustering functions I brought that number
>> down to about 4800, and saw that if my proposed changes were to be
>> executed, about 1 million records would be changed. The majority of
>> these changes involve variations of "microform".
>>
>> I was wondering how other see this large number of "formats". Is it
>> worth trying to "fix" them? Has anyone ever tried?
>
> I'm more concerned with duplicate (or worse, conflated) authors than a
> little cruft in a field that most people don't use, but if these are
> going to be fixed, we should make sure we don't lose data.

I agree on not losing data. I've tried for some of the "formats" to
move the byline, dimensions, language to a separate column, but there
is more (manual?) checking to be done before the non-trivial items on
the list can be replaced.
>
> Many of the [microform] "formats" appear to be the result of bad
> parses and include other information such as subtitles, contribution
> statements, etc.  Is this data duplicated elsewhere in the record
> already or does it need to be moved as part of this process?

I checked for some of the languages (e.g. "[chinese].") and found the
language already indicated on the records, but I don't know if that
holds for all languages. The Byline is usually not in the OL records
when it was part of the $h subfield.
>
> Also, there appear to be some character encoding issues, but it's
> unclear whether it's just in the web page with the listing or in the
> original source data.  Typically this is the result of UTF-8 encoded
> data being interpreted as ISO Latin-1.

I added the BOM to the webpage, so it may be better now.
>
> Tom

Ben
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to