I just finished counting all keys in January's dump. I added the
two-column (key, records) CSV files to the same Gist:
https://gist.github.com/1884546

It's kind of messy - it looks like fields from previous schema designs
were not transformed to fields in new schemas (e.g. the field
"coverimage" is deprecated in favor of "covers", but the data is still
there).

On 22 February 2012 16:55, Karen Coyle <[email protected]> wrote:
> Ben, where did the strings like: "amazon.co.jp" come from? did you grab
> the domain names? or were these all text strings found in the field?
>
> kc

I ran a script that keeps a count for all keys it found in the
identifiers object. So all strings were indeed found in the field. I
have no idea how they came into Open Library, but the dropdown list
you can choose an identifier key from when you manually edit a book
contains flaws and obviously duplicate keys (e.g.
bibliothèque_nationale_de_france and
bibliothèque_nationale_de_france_(bnf)) which may have contributed to
this mess. It looks like this "controlled vocabulary" could use some
documentation too :)

Ben

>
> On 2/22/12 4:30 AM, Ben Companjen wrote:
>> Hi all,
>>
>> Last night I ran a script to count the identifiers found in Edition
>> records in the dump of January 31st.
>>
>> It counted 173 identifiers, including ISBN 10 and 13, ocaid, oclc
>> numbers and all the variations of the identifiers in the list in the
>> edit form. There is a lot of junk in this list (starting with "1sbn",
>> "Select", "isbn", "isbn13"..), but more effort is needed to find the
>> records that contain the junk and clean it up. It appears that it
>> contains classifications too - just like the edit form does?
>>
>> The CSV list is at https://gist.github.com/1884546 - the second column
>> contains the total number of occurrences of the id (counting all the
>> instances in each record), the third column is the number of records
>> that contain the id.
>>
>> Regards,
>>
>> Ben
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
>
> --
> Karen Coyle
> [email protected] http://kcoyle.net
> ph: 1-510-540-7596
> m: 1-510-435-8234
> skype: kcoylenet
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to