Hi Alan, On 11/24/10 10:29 AM, Alan Millar wrote: > On Wed, Nov 24, 2010 at 10:05 AM, Karen Coyle<kco...@kcoyle.net> wrote: >> It might be necessary to drop them out of the Amazon data gathering, >> although it would be a shame because they also contribute some of the >> "long tail" books to the database. I wonder it it wouldn't at least be >> possible to drop all of the instances of > > Personally, I don't think we should automate dropping them; it is good > metadata. Rather, I think we should automate moving it into the > additional people list. The trick will be coming up with some > judicious pattern matching smarts.
That's right. It's surprisingly hard to catch all the permutations of what you perceive to be a pattern. > (But here is another fun one that probably should be just dropped: > http://openlibrary.org/search/authors?q=from+old+catalog > :-) In this example, there's variation in the characters which surround the from old catalog statement. Sometimes [], sometimes () etc. > I see quite a few cases where useful metadata could be moved from one > field to another. Things such as book titles with series or edition > suffixes like "(Great Classics Series)" or > http://openlibrary.org/search?q=large+print+edition > etc. These follow fairly regular patterns, so it could be automated > with supervision. Absolutely. I've noticed you having a shot with "large print" and given the frequency, it looks automated... is that right? (Super awesome!!) http://openlibrary.org/people/amillar Example edit: http://openlibrary.org/books/OL11233153M/In_Spring_Time?b=3&a=2&_compare=Compare&m=diff Looks like edits to some stuff was a bit tricksy? e.g. http://openlibrary.org/recentchanges/2010/11/30/edit-book/42076112 > I'd like to automate some of that myself, but I haven't come across > any references to bulk update tools for users. I've downloaded the > dumps and grep'ed through them as information for author merges, but I > haven't seen any way for me to do the actual updates besides a real > browser. The API docs indicate they are read-only for remote users. We've certainly talked about how fantastic it would be to allow people out there to write bots to work on Open Library records. Presumably, each bot would need to be reviewed by OL staff (or trusted contributors) before they are let loose on the OL dataset... We could build a page under /developers that lists all the bots people write, and provides steps for people to submit a bot for review. We've just been through that process on Wikipedia, fwiw. Was interesting - http://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/OpenlibraryBot Would something like that be worth pursuing? We're also planning to expand the capacity to write data to OL via the API. You can see a list of the APIs we're wanting to document here: http://openlibrary.org/developers/api Alan - can you tell us what you're up to? > Anyone have any techniques they are using currently for mass updates? There are a few bots written by OL employees, like ImportBot, WorkBot, OpenLibraryBot, StatsBot etc. Ben Gimpert wrote the bot that stamped matching records with Goodreads IDs, and an intern, Daniel, wrote something similar to do the same thing with LibraryThing IDs. http://openlibrary.org/people/bgimpertBot https://github.com/bgimpert/openlibrary http://openlibrary.org/people/IdentifierBot https://github.com/dmontalvo/IdentifierBot As far as I know, there are no external mass updates happening, but as I say, this would be fabulous to try to develop. And please, list peeps, correct me if I'm wrong! Cheers, george _______________________________________________ Ol-discuss mailing list Ol-discuss@archive.org http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to ol-discuss-unsubscr...@archive.org