Hi Anand, On 22 February 2012 17:50, Anand Chitipothu <[email protected]> wrote: > > On 22-Feb-2012, at 6:00 PM, Ben Companjen wrote: > > Hi all, > > Last night I ran a script to count the identifiers found in Edition > records in the dump of January 31st. > > It counted 173 identifiers, including ISBN 10 and 13, ocaid, oclc > numbers and all the variations of the identifiers in the list in the > edit form. There is a lot of junk in this list (starting with "1sbn", > "Select", "isbn", "isbn13"..), but more effort is needed to find the > records that contain the junk and clean it up. It appears that it > contains classifications too - just like the edit form does? > > The CSV list is at https://gist.github.com/1884546 - the second column > contains the total number of occurrences of the id (counting all the > instances in each record), the third column is the number of records > that contain the id. > > > Hi Ben, > > Very interesting to see the stats of identifiers. We initially had an option > for everyone to add new identifier and it has grown without any order. We've > removed the ability to add new identifiers after realizing that it was going > out of control.
I haven't been around since the beginning of Open Library, but I understand that creates some kind of chaos ;) > > It will nice if someone can write a bot to fix the existing identifiers. > Will you be interested to write one? I could give it a try, but I should do other things too sometime. But before anyone can write a bot, it would be interesting to know what it should do. What identifiers are preferred (e.g. "google" is more popular than "google_books", but the latter may be clearer)? More analysis is needed to see if values are compatible (e.g. "isbn" -> "ibsn_10" or "isbn_13"?). I'll see if I can do that analysis first, then perhaps write a bot later. Perhaps someone with more library knowledge can say some things about these identifiers? > > I've sorted the identifiers on the total-occurences count. > > https://gist.github.com/1885956#file_edition_identifiers_sorted_2012_01_31.csv > > What do you mean by "record occurrences"? Is that the number of records that > have this identifier used? In that case it looks like that number of "ocaid" > is wrong. We only allow one ocaid per edition and it should be exactly same > as the total-occurences. Can you check it once again? Ah, I see what went wrong there: ocaid is the only identifier that is not a list and I didn't check that before adding the len() to the count. The number is this large because the average ocaid is 22.1 characters long :) Ben > > Anand > > > > > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] > _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
