[MCM] Google's book search is a disaster for scholarship

Mark Crispin Miller Sun, 25 Oct 2020 16:56:01 -0700

*From 2009, but more relevant than ever.*

https://www.chronicle.com/article/googles-book-search-a-disaster-for-scholars/
> The Chronicle Review
> Google’s Book Search: A Disaster for Scholars
> *By **Geoffrey Nunberg*
> August 31, 2009
>
> [image: OSU-Thompson-Library,-Grand-Reading-Room]
>
> Whether the Google books settlement passes muster with the U.S. District
> Court and the Justice Department, Google’s book search is clearly on track
> to becoming the world’s largest digital library. No less important, it is
> also almost certain to be the last one. Google’s five-year head start and
> its relationships with libraries and publishers give it an effective
> monopoly: No competitor will be able to come after it on the same scale.
> Nor is technology going to lower the cost of entry. Scanning will always be
> an expensive, labor-intensive project. Of course, 50 or 100 years from now
> control of the collection may pass from Google to somebody else—Elsevier,
> Unesco, Wal-Mart. But it’s safe to assume that the digitized books that
> scholars will be working with then will be the very same ones that are
> sitting on Google’s servers today, augmented by the millions of titles
> published in the interim.
>
> That realization lends a particular urgency to the concerns that people
> have voiced about the settlement —about pricing, access, and privacy, among
> other things. But for scholars, it raises another, equally basic question:
> What assurances do we have that Google will do this right?
>
> Doing it right depends on what exactly “it” is. Google has been something
> of a shape-shifter in describing the project. The company likes to refer to
> Google’s book search as a “library,” but it generally talks about books as
> just another kind of information resource to be incorporated into Greater
> Google. As Sergey Brin, co-founder of Google, puts it: “We just feel this
> is part of our core mission. There is fantastic information in books. Often
> when I do a search, what is in a book is miles ahead of what I find on a
> Web site.”
>
> Seen in that light, the quality of Google’s book search will be measured
> by how well it supports the familiar activity that we have come to think of
> as “googling,” in tribute to the company’s specialty: entering in a string
> of keywords in an effort to locate specific information, like the dates of
> the Franco-Prussian War. For those purposes, we don’t really care about
> metadata—the whos, whats, wheres, and whens provided by a library catalog.
> It’s enough just to find a chunk of a book that answers our needs and
> barrel into it sideways.
>
> But we’re sometimes interested in finding a book for reasons that have
> nothing to do with the information it contains, and for those purposes
> googling is not a very efficient way to search. If you’re looking for a
> particular edition of *Leaves of Grass* and simply punch in, “I contain
> multitudes,” that’s what you’ll get. For those purposes, you want to be
> able to come in via the book’s metadata, the same way you do if you’re
> trying to assemble all the French editions of Rousseau’s *Social Contract*
> published before 1800 or books of Victorian sermons that talk about
> profanity.
>
> Or you may be interested in books simply as records of the language as it
> was used in various periods or genres. Not surprisingly, that’s what gets
> linguists and assorted wordinistas adrenalized at the thought of all the
> big historical corpora that are coming online. But it also raises alluring
> possibilities for social, political, and intellectual historians and for
> all the strains of literary philology, old and new. With the vast
> collection of published books at hand, you can track the way happiness
> replaced felicity in the 17th century, quantify the rise and fall of
> propaganda or industrial democracy over the course of the 20th century, or
> pluck out all the Victorian novels that contain the phrase “gentle reader.”
>
> But to pose those questions, you need reliable metadata about dates and
> categories, which is why it’s so disappointing that the book search’s
> metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a
> mess.
>
> Start with publication dates. To take Google’s word for it, 1899 was a
> literary *annus mirabilis,* which saw the publication of Raymond
> Chandler’s *Killer in the Rain*, *The Portable Dorothy Parker*, André
> Malraux’s *La Condition Humaine*, Stephen King’s *Christine*, *The
> Complete Shorter Fiction of Virginia Woolf*, Raymond Williams’s *Culture
> and Society 1780-1950,* and Robert Shelton’s biography of Bob Dylan, to
> name just a few. And while there may be particular reasons why 1899 comes
> up so often, such misdatings are spread out across the centuries. A book on
> Peter F. Drucker is dated 1905, four years before the management consultant
> was even born; a book of Virginia Woolf’s letters is dated 1900, when she
> would have been 8 years old. Tom Wolfe’s *Bonfire of the Vanities* is
> dated 1888, and an edition of Henry James’s *What Maisie Knew* is dated
> 1848.
>
> Of course, there are bound to be occasional howlers in a corpus as
> extensive as Google’s book search, but these errors are endemic. A search
> on “Internet” in books published before 1950 produces 527 results;
> “Medicare” for the same period gets almost 1,600. Or you can simply enter
> the names of famous writers or public figures and restrict your search to
> works published before the year of their birth. “Charles Dickens” turns up
> 182 results for publications before 1812, the vast majority of them
> referring to the writer. The same type of search turns up 81 hits for
> Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for
> Barack Obama. (Or maybe that was another Barack Obama.)
>
> How frequent are such errors? A search on books published before 1920
> mentioning “candy bar” turns up 66 hits, of which 46—70 percent—are
> misdated. I don’t think that’s representative of the overall proportion of
> metadata errors, though they are much more common in older works than for
> the recent titles Google received directly from publishers. But even if the
> proportion of misdatings is only 5 percent, the corpus is riddled with
> hundreds of thousands of erroneous publication dates.
>
> Google acknowledges the incorrect dates but says they came from the
> providers. It’s true that Google has received some groups of books that are
> systematically misdated, like a collection of Portuguese-language works all
> dated 1899. But a very large proportion of the errors are clearly Google’s
> own doing. A lot of them arise from uneven efforts to automatically extract
> a publication date from a scanned text. A 1901 history of bookplates from
> the Harvard University Library is correctly dated in the library’s catalog.
> Google’s incorrect date of 1574 for the volume is drawn from an Elizabethan
> armorial bookplate displayed on the frontispiece. An 1890 guidebook called 
> *London
> of To-Day* is correctly dated in the Harvard catalog, but Google assigns
> it a date of 1774, which is taken from a front-matter advertisement for a
> shirt-and-hosiery manufacturer that boasts it was established in that year.
>
> Then there are the classification errors, which taken together can make
> for a kind of absurdist poetry. H.L. Mencken’s *The American Language* is
> classified as Family & Relationships. A French edition of *Hamlet* and a
> Japanese edition of *Madame Bovary* are both classified as Antiques and
> Collectibles (a 1930 English edition of Flaubert’s novel is classified
> under Physicians, which I suppose makes a bit more sense.) An edition of *Moby
> Dick* is labeled Computers; *The Cat Lover’s Book of Fascinating Facts*
> falls under Technology & Engineering. And a catalog of copyright entries
> from the Library of Congress is listed under Drama (for a moment I wondered
> if maybe that one was just Google’s little joke).
> You can see how pervasive those misclassifications are when you look at
> all the labels assigned to a single famous work. Of the first 10 results
> for *Tristram Shandy,* four are classified as Fiction, four as Family &
> Relationships, one as Biography & Autobiography, and one is not classified.
> Other editions of the novel are classified as ‘Literary Collections,
> History, and Music. The first 10 hits for *Leaves of Grass* are variously
> classified as Poetry, ‘Juvenile Nonfiction, Fiction, Literary Criticism,
> Biography & Autobiography, and, mystifyingly, Counterfeits and
> Counterfeiting. And various editions of *Jane Eyre* are classified as
> History, Governesses, Love Stories, Architecture, and Antiques &
> Collectibles (as in, “Reader, I marketed him.”).
>
> Here, too, Google has blamed the errors on the libraries and publishers
> who provided the books. But the libraries can’t be responsible for books
> mislabeled as Health and Fitness and Antiques and Collectibles, for the
> simple reason that those categories are drawn from the Book Industry
> Standards and Communications codes, which are used by the publishers to
> tell booksellers where to put books on the shelves, not from any of the
> classification systems used by libraries. And BISAC classifications weren’t
> in wide use before the last decade or two, so only Google can be
> responsible for their misapplications on numerous books published earlier
> than that: the 1919 edition of *Robinson Crusoe* assigned to Crafts &
> Hobbies or the 1907 edition of Sir Thomas Browne’s *Hydriotaphia:
> Urne-Buriall,* which has been assigned to Gardening.
>
> Google’s fine algorithmic hand is also evident in a lot of classifications
> of recent works. The 2003 edition of Susan Bordo’s *Unbearable Weight:
> Feminism, Western Culture, and the Body* (misdated 1899) is assigned to
> Health & Fitness—not a labeling you could imagine coming from its
> publisher, the University of California Press, but one a classifier might
> come up with on the basis of the title, like the Religion tag that Google
> assigns to a 2001 biography of Mae West that’s subtitled An *Icon in
> Black and White* or the Health & Fitness label on a 1962 number of the
> medievalist journal *Speculum*.
>
> But even when it gets the BISAC categories roughly right, the more
> important question is why Google would want to use those headings in the
> first place. People from Google have told me they weren’t included at the
> publishers’ request, and it may be that someone thought they’d be helpful
> for ad placement. (The ad placement on Google’s book search right now is
> often comical, as when a search for *Leaves of Grass* brings up ads for
> plant and sod retailers—though that’s strictly Google’s problem, and one,
> you’d imagine, that they’re already on top of.) But it’s a disastrous
> choice for the book search. The BISAC scheme is well-suited for a chain
> bookstore or a small public library, where consumers or patrons browse for
> books on the shelves. But it’s of little use when you’re flying blind in a
> library with several million titles, including scholarly works, foreign
> works, and vast quantities of books from earlier periods. For example the
> BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like
> New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the
> Poetry subject heading has just 20 subheadings. That means that Bambi and
> Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and
> Verlaine have to scrunch together in the single subheading reserved for
> Poetry/Continental European. In short, Google has taken a group of the
> world’s great research collections and returned them in the form of a
> suburban-mall bookstore.
>
> Such examples don’t exhaust Google’s metadata errors by any means. In
> addition to the occasionally quizzical renamings of works (*Moby Dick: or
> the White Wall*), there are a number of mismatches of titles and texts.
> Click on the link for the 1818 *Théorie de l’Univers*, a work on
> cosmology by the Napoleonic mathematician and general Jacques Alexander
> François Allix, and it takes you to Barbara Taylor Bradford’s 1983 novel 
> *Voice
> of the Heart,* while the link on a misdated number of Dickens’s *Household
> Words* takes you to a 1742 *Histoire de l’Académie Royale des Sciences*.
> Numerous entries mix up the names of authors, editors, and writers of
> introductions, so that the “about this book” page for an edition of one
> French novel shows the striking attribution, “Madame Bovary By Henry
> James.” More mysterious is the entry for a book called *The Mosaic
> Navigator: The Essential Guide to the Internet Interface,* which is dated
> 1939 and attributed to Sigmund Freud and Katherine Jones. The only
> connection I can come up with is that Jones was the translator of Freud’s 
> *Moses
> and Monotheism,* which must have somehow triggered the other sense of the
> word “mosaic,” though the details of the process leave me baffled.
>
> For the present, then, scholars will have to put on hold their visions of
> tracking the 19th-century fortunes of liberalism or quantifying the shift
> of “United States” from a plural to singular noun phrase over the first
> century of the republic: The metadata simply aren’t up to it. It’s true
> that Google is aware of a lot of these problems and they’ve pledged to fix
> them. (Indeed, since I presented some of these errors at a conference last
> week, Google has already rushed to correct many of them.) But it isn’t
> clear whether they plan to go about this in the same way they’re addressing
> the scanning errors that riddle the texts, correcting them as (and if)
> they’re reported. That isn’t adequate here: There are simply too many
> errors. And while Google’s machine classification system will certainly
> improve, extracting metadata mechanically isn’t sufficient for scholarly
> purposes. After first seeming indifferent, Google decided it did want to
> acquire the library records for scanned books along with the scans
> themselves, but as of now the company hasn’t licensed them for display or
> use—hence, presumably, those stabs at automatically recovering publication
> dates from the scanned texts.
>
> Some of the slack may be picked up by other organizations such as the
> Internet Archive or Hathi Trust, a consortium of participating libraries
> that is planning to make available several million of the public-domain
> books from their collections that Google scanned, along with their
> bibliographic records. But for now those sources can only provide access to
> books in the public domain, about 15 percent of the scanned collections;
> only Google will have the right to display the orphan works published since
> 1923.
>
> In any case, none of that should relieve Google of the responsibility of
> making its collections an adequate resource for scholarly research. That
> means, at a minimum, licensing the catalogs of the Library of Congress and
> OCLC Online Computer Library Center and incorporating them into the search
> engine so that users can get accurate results when they search on various
> combinations of dates, keywords, subject headings, and the like.
> (“Adequate” means a lot more than that, as well, from improving the quality
> of scanning to improving Google’s very flaky hit-count algorithms and
> rationalizing the resulting rankings, which now make no sense at all and
> often lead with inferior or shoddy editions of classic works.) Whether or
> not a guarantee of quality is a contractual obligation, it’s implicit in
> the project itself. Google has, justifiably, described its book-scanning
> program as a public good. But as Pamela Samuelson, a director of the Center
> for Law & Technology at the University of California at Berkeley, has said,
> every great public good implies a great public trust.
>
> I’m actually more optimistic than some of my colleagues who have
> criticized the settlement. Not that I’m counting on selfless
> public-spiritedness to motivate Google to invest the time and resources in
> getting this right. But I have the sense that a lot of the initial problems
> are due to Google’s slightly clueless fumbling as it tried master a domain
> that turned out to be a lot more complex than the company first realized.
> It’s clear that Google designed the system without giving much thought to
> the need for reliable metadata. In fact, Google’s great achievement as a
> Web search engine was to demonstrate how easy it could be to locate useful
> information without attending to metadata or resorting to Yahoo-like
> schemes of classification. But books aren’t simply vehicles for
> communicating information, and managing a vast library collection requires
> different skills, approaches, and data than those that enabled Google to
> dominate Web searching.
>
> That makes for a steep learning curve, all the more so because of Google’s
> haste to complete the project so that potential competitors would be
> confronted with a fait accompli. But whether or not the needs of scholars
> are a priority, the company doesn’t want Google’s book search to become a
> running scholarly joke. And it may be responsive to pressure from its
> university library partners—who weren’t particularly attentive to questions
> of quality when they signed on with Google—particularly if they are urged
> (or if necessary, prodded) to make noise about shoddy metadata by the
> scholars whose interests they represent. If recent history teaches us
> anything, it’s that Google is a very quick study.
>
> *Geoffrey Nunberg, a linguist, is an adjunct full professor at the School
> of Information at the University of California at Berkeley. Images of some
> of the errors discussed in this article can be found here.
> <http://tinyurl.com/lhjvns>*
> *We welcome your thoughts and questions about this article. Please email
> the editors <[email protected]> or submit a letter
> <[email protected]> for publication.*
>
> *-------------------------------------------- *
>
>
---


Support News from Underground: https://bit.ly/NFUSupport

Visit News from Underground: https://markcrispinmiller.com

You received this email because you are subscribed to News from Underground. To 
unsubscribe from this email list, please go to: 
http://www.simplelists.com/confirm.php?u=pIdjNUgiG2h8yxbhC54SSy4SEskAoEMs

For archives, please go to: https://archives.simplelists.com/nfu

[MCM] Google's book search is a disaster for scholarship

Reply via email to