*From 2009, but more relevant than ever.* https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/ > The Chronicle Review > Google’s Book Search: A Disaster for Scholars > *By **Geoffrey Nunberg* > August 31, 2009 > > [image: OSU-Thompson-Library,-Grand-Reading-Room] > > Whether the Google books settlement passes muster with the U.S. District > Court and the Justice Department, Google’s book search is clearly on track > to becoming the world’s largest digital library. No less important, it is > also almost certain to be the last one. Google’s five-year head start and > its relationships with libraries and publishers give it an effective > monopoly: No competitor will be able to come after it on the same scale. > Nor is technology going to lower the cost of entry. Scanning will always be > an expensive, labor-intensive project. Of course, 50 or 100 years from now > control of the collection may pass from Google to somebody else—Elsevier, > Unesco, Wal-Mart. But it’s safe to assume that the digitized books that > scholars will be working with then will be the very same ones that are > sitting on Google’s servers today, augmented by the millions of titles > published in the interim. > > That realization lends a particular urgency to the concerns that people > have voiced about the settlement —about pricing, access, and privacy, among > other things. But for scholars, it raises another, equally basic question: > What assurances do we have that Google will do this right? > > Doing it right depends on what exactly “it” is. Google has been something > of a shape-shifter in describing the project. The company likes to refer to > Google’s book search as a “library,” but it generally talks about books as > just another kind of information resource to be incorporated into Greater > Google. As Sergey Brin, co-founder of Google, puts it: “We just feel this > is part of our core mission. There is fantastic information in books. Often > when I do a search, what is in a book is miles ahead of what I find on a > Web site.” > > Seen in that light, the quality of Google’s book search will be measured > by how well it supports the familiar activity that we have come to think of > as “googling,” in tribute to the company’s specialty: entering in a string > of keywords in an effort to locate specific information, like the dates of > the Franco-Prussian War. For those purposes, we don’t really care about > metadata—the whos, whats, wheres, and whens provided by a library catalog. > It’s enough just to find a chunk of a book that answers our needs and > barrel into it sideways. > > But we’re sometimes interested in finding a book for reasons that have > nothing to do with the information it contains, and for those purposes > googling is not a very efficient way to search. If you’re looking for a > particular edition of *Leaves of Grass* and simply punch in, “I contain > multitudes,” that’s what you’ll get. For those purposes, you want to be > able to come in via the book’s metadata, the same way you do if you’re > trying to assemble all the French editions of Rousseau’s *Social Contract* > published before 1800 or books of Victorian sermons that talk about > profanity. > > Or you may be interested in books simply as records of the language as it > was used in various periods or genres. Not surprisingly, that’s what gets > linguists and assorted wordinistas adrenalized at the thought of all the > big historical corpora that are coming online. But it also raises alluring > possibilities for social, political, and intellectual historians and for > all the strains of literary philology, old and new. With the vast > collection of published books at hand, you can track the way happiness > replaced felicity in the 17th century, quantify the rise and fall of > propaganda or industrial democracy over the course of the 20th century, or > pluck out all the Victorian novels that contain the phrase “gentle reader.” > > But to pose those questions, you need reliable metadata about dates and > categories, which is why it’s so disappointing that the book search’s > metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a > mess. > > Start with publication dates. To take Google’s word for it, 1899 was a > literary *annus mirabilis,* which saw the publication of Raymond > Chandler’s *Killer in the Rain*, *The Portable Dorothy Parker*, André > Malraux’s *La Condition Humaine*, Stephen King’s *Christine*, *The > Complete Shorter Fiction of Virginia Woolf*, Raymond Williams’s *Culture > and Society 1780-1950,* and Robert Shelton’s biography of Bob Dylan, to > name just a few. And while there may be particular reasons why 1899 comes > up so often, such misdatings are spread out across the centuries. A book on > Peter F. Drucker is dated 1905, four years before the management consultant > was even born; a book of Virginia Woolf’s letters is dated 1900, when she > would have been 8 years old. Tom Wolfe’s *Bonfire of the Vanities* is > dated 1888, and an edition of Henry James’s *What Maisie Knew* is dated > 1848. > > Of course, there are bound to be occasional howlers in a corpus as > extensive as Google’s book search, but these errors are endemic. A search > on “Internet” in books published before 1950 produces 527 results; > “Medicare” for the same period gets almost 1,600. Or you can simply enter > the names of famous writers or public figures and restrict your search to > works published before the year of their birth. “Charles Dickens” turns up > 182 results for publications before 1812, the vast majority of them > referring to the writer. The same type of search turns up 81 hits for > Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for > Barack Obama. (Or maybe that was another Barack Obama.) > > How frequent are such errors? A search on books published before 1920 > mentioning “candy bar” turns up 66 hits, of which 46—70 percent—are > misdated. I don’t think that’s representative of the overall proportion of > metadata errors, though they are much more common in older works than for > the recent titles Google received directly from publishers. But even if the > proportion of misdatings is only 5 percent, the corpus is riddled with > hundreds of thousands of erroneous publication dates. > > Google acknowledges the incorrect dates but says they came from the > providers. It’s true that Google has received some groups of books that are > systematically misdated, like a collection of Portuguese-language works all > dated 1899. But a very large proportion of the errors are clearly Google’s > own doing. A lot of them arise from uneven efforts to automatically extract > a publication date from a scanned text. A 1901 history of bookplates from > the Harvard University Library is correctly dated in the library’s catalog. > Google’s incorrect date of 1574 for the volume is drawn from an Elizabethan > armorial bookplate displayed on the frontispiece. An 1890 guidebook called > *London > of To-Day* is correctly dated in the Harvard catalog, but Google assigns > it a date of 1774, which is taken from a front-matter advertisement for a > shirt-and-hosiery manufacturer that boasts it was established in that year. > > Then there are the classification errors, which taken together can make > for a kind of absurdist poetry. H.L. Mencken’s *The American Language* is > classified as Family & Relationships. A French edition of *Hamlet* and a > Japanese edition of *Madame Bovary* are both classified as Antiques and > Collectibles (a 1930 English edition of Flaubert’s novel is classified > under Physicians, which I suppose makes a bit more sense.) An edition of *Moby > Dick* is labeled Computers; *The Cat Lover’s Book of Fascinating Facts* > falls under Technology & Engineering. And a catalog of copyright entries > from the Library of Congress is listed under Drama (for a moment I wondered > if maybe that one was just Google’s little joke). > You can see how pervasive those misclassifications are when you look at > all the labels assigned to a single famous work. Of the first 10 results > for *Tristram Shandy,* four are classified as Fiction, four as Family & > Relationships, one as Biography & Autobiography, and one is not classified. > Other editions of the novel are classified as ‘Literary Collections, > History, and Music. The first 10 hits for *Leaves of Grass* are variously > classified as Poetry, ‘Juvenile Nonfiction, Fiction, Literary Criticism, > Biography & Autobiography, and, mystifyingly, Counterfeits and > Counterfeiting. And various editions of *Jane Eyre* are classified as > History, Governesses, Love Stories, Architecture, and Antiques & > Collectibles (as in, “Reader, I marketed him.”). > > Here, too, Google has blamed the errors on the libraries and publishers > who provided the books. But the libraries can’t be responsible for books > mislabeled as Health and Fitness and Antiques and Collectibles, for the > simple reason that those categories are drawn from the Book Industry > Standards and Communications codes, which are used by the publishers to > tell booksellers where to put books on the shelves, not from any of the > classification systems used by libraries. And BISAC classifications weren’t > in wide use before the last decade or two, so only Google can be > responsible for their misapplications on numerous books published earlier > than that: the 1919 edition of *Robinson Crusoe* assigned to Crafts & > Hobbies or the 1907 edition of Sir Thomas Browne’s *Hydriotaphia: > Urne-Buriall,* which has been assigned to Gardening. > > Google’s fine algorithmic hand is also evident in a lot of classifications > of recent works. The 2003 edition of Susan Bordo’s *Unbearable Weight: > Feminism, Western Culture, and the Body* (misdated 1899) is assigned to > Health & Fitness—not a labeling you could imagine coming from its > publisher, the University of California Press, but one a classifier might > come up with on the basis of the title, like the Religion tag that Google > assigns to a 2001 biography of Mae West that’s subtitled An *Icon in > Black and White* or the Health & Fitness label on a 1962 number of the > medievalist journal *Speculum*. > > But even when it gets the BISAC categories roughly right, the more > important question is why Google would want to use those headings in the > first place. People from Google have told me they weren’t included at the > publishers’ request, and it may be that someone thought they’d be helpful > for ad placement. (The ad placement on Google’s book search right now is > often comical, as when a search for *Leaves of Grass* brings up ads for > plant and sod retailers—though that’s strictly Google’s problem, and one, > you’d imagine, that they’re already on top of.) But it’s a disastrous > choice for the book search. The BISAC scheme is well-suited for a chain > bookstore or a small public library, where consumers or patrons browse for > books on the shelves. But it’s of little use when you’re flying blind in a > library with several million titles, including scholarly works, foreign > works, and vast quantities of books from earlier periods. For example the > BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like > New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the > Poetry subject heading has just 20 subheadings. That means that Bambi and > Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and > Verlaine have to scrunch together in the single subheading reserved for > Poetry/Continental European. In short, Google has taken a group of the > world’s great research collections and returned them in the form of a > suburban-mall bookstore. > > Such examples don’t exhaust Google’s metadata errors by any means. In > addition to the occasionally quizzical renamings of works (*Moby Dick: or > the White Wall*), there are a number of mismatches of titles and texts. > Click on the link for the 1818 *Théorie de l’Univers*, a work on > cosmology by the Napoleonic mathematician and general Jacques Alexander > François Allix, and it takes you to Barbara Taylor Bradford’s 1983 novel > *Voice > of the Heart,* while the link on a misdated number of Dickens’s *Household > Words* takes you to a 1742 *Histoire de l’Académie Royale des Sciences*. > Numerous entries mix up the names of authors, editors, and writers of > introductions, so that the “about this book” page for an edition of one > French novel shows the striking attribution, “Madame Bovary By Henry > James.” More mysterious is the entry for a book called *The Mosaic > Navigator: The Essential Guide to the Internet Interface,* which is dated > 1939 and attributed to Sigmund Freud and Katherine Jones. The only > connection I can come up with is that Jones was the translator of Freud’s > *Moses > and Monotheism,* which must have somehow triggered the other sense of the > word “mosaic,” though the details of the process leave me baffled. > > For the present, then, scholars will have to put on hold their visions of > tracking the 19th-century fortunes of liberalism or quantifying the shift > of “United States” from a plural to singular noun phrase over the first > century of the republic: The metadata simply aren’t up to it. It’s true > that Google is aware of a lot of these problems and they’ve pledged to fix > them. (Indeed, since I presented some of these errors at a conference last > week, Google has already rushed to correct many of them.) But it isn’t > clear whether they plan to go about this in the same way they’re addressing > the scanning errors that riddle the texts, correcting them as (and if) > they’re reported. That isn’t adequate here: There are simply too many > errors. And while Google’s machine classification system will certainly > improve, extracting metadata mechanically isn’t sufficient for scholarly > purposes. After first seeming indifferent, Google decided it did want to > acquire the library records for scanned books along with the scans > themselves, but as of now the company hasn’t licensed them for display or > use—hence, presumably, those stabs at automatically recovering publication > dates from the scanned texts. > > Some of the slack may be picked up by other organizations such as the > Internet Archive or Hathi Trust, a consortium of participating libraries > that is planning to make available several million of the public-domain > books from their collections that Google scanned, along with their > bibliographic records. But for now those sources can only provide access to > books in the public domain, about 15 percent of the scanned collections; > only Google will have the right to display the orphan works published since > 1923. > > In any case, none of that should relieve Google of the responsibility of > making its collections an adequate resource for scholarly research. That > means, at a minimum, licensing the catalogs of the Library of Congress and > OCLC Online Computer Library Center and incorporating them into the search > engine so that users can get accurate results when they search on various > combinations of dates, keywords, subject headings, and the like. > (“Adequate” means a lot more than that, as well, from improving the quality > of scanning to improving Google’s very flaky hit-count algorithms and > rationalizing the resulting rankings, which now make no sense at all and > often lead with inferior or shoddy editions of classic works.) Whether or > not a guarantee of quality is a contractual obligation, it’s implicit in > the project itself. Google has, justifiably, described its book-scanning > program as a public good. But as Pamela Samuelson, a director of the Center > for Law & Technology at the University of California at Berkeley, has said, > every great public good implies a great public trust. > > I’m actually more optimistic than some of my colleagues who have > criticized the settlement. Not that I’m counting on selfless > public-spiritedness to motivate Google to invest the time and resources in > getting this right. But I have the sense that a lot of the initial problems > are due to Google’s slightly clueless fumbling as it tried master a domain > that turned out to be a lot more complex than the company first realized. > It’s clear that Google designed the system without giving much thought to > the need for reliable metadata. In fact, Google’s great achievement as a > Web search engine was to demonstrate how easy it could be to locate useful > information without attending to metadata or resorting to Yahoo-like > schemes of classification. But books aren’t simply vehicles for > communicating information, and managing a vast library collection requires > different skills, approaches, and data than those that enabled Google to > dominate Web searching. > > That makes for a steep learning curve, all the more so because of Google’s > haste to complete the project so that potential competitors would be > confronted with a fait accompli. But whether or not the needs of scholars > are a priority, the company doesn’t want Google’s book search to become a > running scholarly joke. And it may be responsive to pressure from its > university library partners—who weren’t particularly attentive to questions > of quality when they signed on with Google—particularly if they are urged > (or if necessary, prodded) to make noise about shoddy metadata by the > scholars whose interests they represent. If recent history teaches us > anything, it’s that Google is a very quick study. > > *Geoffrey Nunberg, a linguist, is an adjunct full professor at the School > of Information at the University of California at Berkeley. Images of some > of the errors discussed in this article can be found here. > <http://tinyurl.com/lhjvns>* > *We welcome your thoughts and questions about this article. Please email > the editors <[email protected]> or submit a letter > <[email protected]> for publication.* > > *-------------------------------------------- * > > ---
Support News from Underground: https://bit.ly/NFUSupport Visit News from Underground: https://markcrispinmiller.com You received this email because you are subscribed to News from Underground. To unsubscribe from this email list, please go to: http://www.simplelists.com/confirm.php?u=pIdjNUgiG2h8yxbhC54SSy4SEskAoEMs For archives, please go to: https://archives.simplelists.com/nfu
