Re: [ol-tech] Counting identifiers in Editions

Ben Companjen Wed, 22 Feb 2012 12:43:56 -0800

Hi Anand,

On 22 February 2012 17:50, Anand Chitipothu <[email protected]> wrote:
>
> On 22-Feb-2012, at 6:00 PM, Ben Companjen wrote:
>
> Hi all,
>
> Last night I ran a script to count the identifiers found in Edition
> records in the dump of January 31st.
>
> It counted 173 identifiers, including ISBN 10 and 13, ocaid, oclc
> numbers and all the variations of the identifiers in the list in the
> edit form. There is a lot of junk in this list (starting with "1sbn",
> "Select", "isbn", "isbn13"..), but more effort is needed to find the
> records that contain the junk and clean it up. It appears that it
> contains classifications too - just like the edit form does?
>
> The CSV list is at https://gist.github.com/1884546 - the second column
> contains the total number of occurrences of the id (counting all the
> instances in each record), the third column is the number of records
> that contain the id.
>
>
> Hi Ben,
>
> Very interesting to see the stats of identifiers. We initially had an option
> for everyone to add new identifier and it has grown without any order. We've
> removed the ability to add new identifiers after realizing that it was going
> out of control.


I haven't been around since the beginning of Open Library, but I
understand that creates some kind of chaos ;)
>
> It will nice if someone can write a bot to fix the existing identifiers.
> Will you be interested to write one?

I could give it a try, but I should do other things too sometime. But
before anyone can write a bot, it would be interesting to know what it
should do. What identifiers are preferred (e.g. "google" is more
popular than "google_books", but the latter may be clearer)? More
analysis is needed to see if values are compatible (e.g. "isbn" ->
"ibsn_10" or "isbn_13"?). I'll see if I can do that analysis first,
then perhaps write a bot later.
Perhaps someone with more library knowledge can say some things about
these identifiers?
>
> I've sorted the identifiers on the total-occurences count.
>
> https://gist.github.com/1885956#file_edition_identifiers_sorted_2012_01_31.csv
>
> What do you mean by "record occurrences"? Is that the number of records that
> have this identifier used? In that case it looks like that number of "ocaid"
> is wrong. We only allow one ocaid per edition and it should be exactly same
> as the total-occurences. Can you check it once again?

Ah, I see what went wrong there: ocaid is the only identifier that is
not a list and I didn't check that before adding the len() to the
count. The number is this large because the average ocaid is 22.1
characters long :)

Ben
>
> Anand
>
>
>
>
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to
> [email protected]
>
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Counting identifiers in Editions

Reply via email to