Someone asked me off-list what types of OpenLibrary data cleanups I'd
suggest.  Below is the list that I came up with off the top of my head.
What others would folks suggest?  What do you think is more important?

Possible data cleanup targets:
- kill historical spam
- identify spammers & spam quickly
- normalize author names (many are still in last, first form while the OL
standard is first last)
- normalize all strings to Unicode NFC and/or just search versions to NFKC
- merge duplicate authors (user-driven merges are currently disabled due to
vandalism & difficulty of undoing bad merges)
- merge duplicate works
- create works for editions with no works (one instance of a variety of
different data consistency issues which have crept in over time)
- add links for authors & works to Wikipedia, Wikidata, Freebase, IMDB,
NNDB, MusicBrainz, Project Gutenberg, GoodReads, etc
- rank electronic editions by quality (OCR quality varies wildly)
- clean OCR'd texts (actually an IA task, not OL)

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to