Someone asked me off-list what types of OpenLibrary data cleanups I'd suggest. Below is the list that I came up with off the top of my head. What others would folks suggest? What do you think is more important?
Possible data cleanup targets: - kill historical spam - identify spammers & spam quickly - normalize author names (many are still in last, first form while the OL standard is first last) - normalize all strings to Unicode NFC and/or just search versions to NFKC - merge duplicate authors (user-driven merges are currently disabled due to vandalism & difficulty of undoing bad merges) - merge duplicate works - create works for editions with no works (one instance of a variety of different data consistency issues which have crept in over time) - add links for authors & works to Wikipedia, Wikidata, Freebase, IMDB, NNDB, MusicBrainz, Project Gutenberg, GoodReads, etc - rank electronic editions by quality (OCR quality varies wildly) - clean OCR'd texts (actually an IA task, not OL) Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
