Hi Karen, all, On 13 April 2012 18:32, Karen Coyle <[email protected]> wrote: > Ooof! I took a look at some of these and mostly they are badly input $h > subfields from the 245 field, and the ones I saw were from Talis. (That > Talis data will haunt us forever -- very dirty.)
I saw a record from the Library of Congress* with ":" in 245$h and Oregon Libraries has at least a few MARC records that have the Byline in the $h subfield, so Talis is not the only dirty data producer/distributor ;) And on import, most of the punctuation marks like [, ], : and / can be stripped I think. There are 376566 records with "[microform] :" in the latest datadump, whereas there were 376323 in January's datadump. See most variants (with proposed normalization "Microform") in this huge table: http://companjen.name/ol/editions_formats-2012-01-31.html > I think that a > pull-down would be a good idea for manual input. "Both" as format could be an answer to the question "What sort of book is it? Paperback; Hardcover, etc.", the label of the field. So perhaps some more help is good, although I assume Open Library will contain book+CD combinations etc. The form must still support that, of course. > From relatively good > MARC records the list of terms should be quite short although a little > creative input does take place. The valid terms in MARC (which are > called General Material Designations) are: > > http://www.uproc.lib.mi.us/cat/gmd.htm Is it true that Paperback and Hardcover are not on the MARC list of GMDs or in RDA's lists of content/carrier/material types? I guess these are "concepts" under "text", but since there are 3M+ Paperbacks and 1.5M Hardcovers in OL, I was a little surprised to not find them. BTW: sorted by number of records, "Unknown binding" is in third place and "Audio CD" in seventh. > > Unfortunately, libraries do catalog the paperback and hardcopy on the > same record. However, those terms are not acceptable GMDs. They *do*, > however, result in more than one ISBN coming in on a single record. Does Open Library say anywhere that paperbacks and hardcovers should be separate editions? I consider them different, but I get the feeling newly published authors who add their own books don't (seem to) care, or maybe just don't know. > > kc > Ben * I'm still waiting for an email from the LC explaining why they put "[B]" in the Dewey Decimal Classification field. :) > On 4/11/12 4:00 PM, Ben Companjen wrote: >> Hi all, >> >> In the most recent dump file, I found 7467 different values for the >> (physical) format field in the editions. Every variation in >> capitalization, punctuation, spacing etc. is counted. >> >> Using Google Refine and its clustering functions I brought that number >> down to about 4800, and saw that if my proposed changes Ooowere to be >> executed, about 1 million records would be changed. The majority of >> these changes involve variations of "microform". >> >> I was wondering how other see this large number of "formats". Is it >> worth trying to "fix" them? Has anyone ever tried? >> Some of the strange formats come from manual input; these are typos, >> spam and wrong inputs like ISBNs. Could more detailed instructions >> help prevent these? How about an autocomplete input field, like the >> one for language? >> >> Part of the strange input comes from MARC records, like the ones from >> Library of Congress and Talis. Is it possible for the ImportBot to >> leave formats like ":" out, or autocorrect it? >> >> For example: >> <http://openlibrary.org/query.json?type=/type/edition&physical_format=[Italian]%20/> >> <http://openlibrary.org/query.json?type=/type/edition&physical_format=[chinese].> >> <http://openlibrary.org/query.json?type=/type/edition&physical_format=:> >> <http://openlibrary.org/query.json?type=/type/edition&physical_format=Paperback%20and%20Hardcover> >> (lazy people...) >> >> I already corrected "[microwave]", "Both" and some other values. >> >> Regards, >> >> Ben >> _______________________________________________ >> Ol-tech mailing list >> [email protected] >> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech >> To unsubscribe from this mailing list, send email to >> [email protected] > > -- > Karen Coyle > [email protected] http://kcoyle.net > ph: 1-510-540-7596 > m: 1-510-435-8234 > skype: kcoylenet > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
