Hi all,

In the most recent dump file, I found 7467 different values for the
(physical) format field in the editions. Every variation in
capitalization, punctuation, spacing etc. is counted.

Using Google Refine and its clustering functions I brought that number
down to about 4800, and saw that if my proposed changes were to be
executed, about 1 million records would be changed. The majority of
these changes involve variations of "microform".

I was wondering how other see this large number of "formats". Is it
worth trying to "fix" them? Has anyone ever tried?
Some of the strange formats come from manual input; these are typos,
spam and wrong inputs like ISBNs. Could more detailed instructions
help prevent these? How about an autocomplete input field, like the
one for language?

Part of the strange input comes from MARC records, like the ones from
Library of Congress and Talis. Is it possible for the ImportBot to
leave formats like ":" out, or autocorrect it?

For example:
<http://openlibrary.org/query.json?type=/type/edition&physical_format=[Italian]%20/>
<http://openlibrary.org/query.json?type=/type/edition&physical_format=[chinese].>
<http://openlibrary.org/query.json?type=/type/edition&physical_format=:>
<http://openlibrary.org/query.json?type=/type/edition&physical_format=Paperback%20and%20Hardcover>
(lazy people...)

I already corrected "[microwave]", "Both" and some other values.

Regards,

Ben
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to