Hi all, In the most recent dump file, I found 7467 different values for the (physical) format field in the editions. Every variation in capitalization, punctuation, spacing etc. is counted.
Using Google Refine and its clustering functions I brought that number down to about 4800, and saw that if my proposed changes were to be executed, about 1 million records would be changed. The majority of these changes involve variations of "microform". I was wondering how other see this large number of "formats". Is it worth trying to "fix" them? Has anyone ever tried? Some of the strange formats come from manual input; these are typos, spam and wrong inputs like ISBNs. Could more detailed instructions help prevent these? How about an autocomplete input field, like the one for language? Part of the strange input comes from MARC records, like the ones from Library of Congress and Talis. Is it possible for the ImportBot to leave formats like ":" out, or autocorrect it? For example: <http://openlibrary.org/query.json?type=/type/edition&physical_format=[Italian]%20/> <http://openlibrary.org/query.json?type=/type/edition&physical_format=[chinese].> <http://openlibrary.org/query.json?type=/type/edition&physical_format=:> <http://openlibrary.org/query.json?type=/type/edition&physical_format=Paperback%20and%20Hardcover> (lazy people...) I already corrected "[microwave]", "Both" and some other values. Regards, Ben _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
