Hi Karen, all,

On 13 April 2012 18:32, Karen Coyle <[email protected]> wrote:
> Ooof! I took a look at some of these and mostly they are badly input $h
> subfields from the 245 field, and the ones I saw were from Talis. (That
> Talis data will haunt us forever -- very dirty.)

I saw a record from the Library of Congress* with ":" in 245$h and
Oregon Libraries has at least a few MARC records that have the Byline
in the $h subfield, so Talis is not the only dirty data
producer/distributor ;)

And on import, most of the punctuation marks like [, ], : and / can be
stripped I think. There are 376566 records with "[microform] :" in the
latest datadump, whereas there were 376323 in January's datadump. See
most variants (with proposed normalization "Microform") in this huge
table: http://companjen.name/ol/editions_formats-2012-01-31.html

> I think that a
> pull-down would be a good idea for manual input.
"Both" as format could be an answer to the question "What sort of book
is it? Paperback; Hardcover, etc.", the label of the field. So perhaps
some more help is good, although I assume Open Library will contain
book+CD combinations etc. The form must still support that, of course.

> From relatively good
> MARC records the list of terms should be quite short although a little
> creative input does take place. The valid terms in MARC (which are
> called General Material Designations) are:
>
> http://www.uproc.lib.mi.us/cat/gmd.htm

Is it true that Paperback and Hardcover are not on the MARC list of
GMDs or in RDA's lists of content/carrier/material types?
I guess these are "concepts" under "text", but since there are 3M+
Paperbacks and 1.5M Hardcovers in OL, I was a little surprised to not
find them.
BTW: sorted by number of records, "Unknown binding" is in third place
and "Audio CD" in seventh.

>
> Unfortunately, libraries do catalog the paperback and hardcopy on the
> same record. However, those terms are not acceptable GMDs. They *do*,
> however, result in more than one ISBN coming in on a single record.

Does Open Library say anywhere that paperbacks and hardcovers should
be separate editions? I consider them different, but I get the feeling
newly published authors who add their own books don't (seem to) care,
or maybe just don't know.

>
> kc
>

Ben

* I'm still waiting for an email from the LC explaining why they put
"[B]" in the Dewey Decimal Classification field. :)

> On 4/11/12 4:00 PM, Ben Companjen wrote:
>> Hi all,
>>
>> In the most recent dump file, I found 7467 different values for the
>> (physical) format field in the editions. Every variation in
>> capitalization, punctuation, spacing etc. is counted.
>>
>> Using Google Refine and its clustering functions I brought that number
>> down to about 4800, and saw that if my proposed changes Ooowere to be
>> executed, about 1 million records would be changed. The majority of
>> these changes involve variations of "microform".
>>
>> I was wondering how other see this large number of "formats". Is it
>> worth trying to "fix" them? Has anyone ever tried?
>> Some of the strange formats come from manual input; these are typos,
>> spam and wrong inputs like ISBNs. Could more detailed instructions
>> help prevent these? How about an autocomplete input field, like the
>> one for language?
>>
>> Part of the strange input comes from MARC records, like the ones from
>> Library of Congress and Talis. Is it possible for the ImportBot to
>> leave formats like ":" out, or autocorrect it?
>>
>> For example:
>> <http://openlibrary.org/query.json?type=/type/edition&physical_format=[Italian]%20/>
>> <http://openlibrary.org/query.json?type=/type/edition&physical_format=[chinese].>
>> <http://openlibrary.org/query.json?type=/type/edition&physical_format=:>
>> <http://openlibrary.org/query.json?type=/type/edition&physical_format=Paperback%20and%20Hardcover>
>> (lazy people...)
>>
>> I already corrected "[microwave]", "Both" and some other values.
>>
>> Regards,
>>
>> Ben
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
>
> --
> Karen Coyle
> [email protected] http://kcoyle.net
> ph: 1-510-540-7596
> m: 1-510-435-8234
> skype: kcoylenet
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to