I've finally figured out how to get raw OCR text out of the HathiTrust API, but it is really slow. Any hints out there?
To use the HathiTrust Data API a person needs to first get a couple of access tokens. Applications then need to use the tokens to authenticate. Once this is done, a simple URL can be sent and cool stuff will be returned. For example, the following URL will return the first page of OCR: https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2 By continually incrementing the URL, other pages can be gotten: https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2 https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2 https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2 By incrementing the URL until an error is returned, one can get the whole of the document. I don't think there is a way to get the whole of the document in one go. Similarly, a person can get page images: https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2 https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2 https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2 https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2 Again, by incrementing the URL until an error is returned, all the images can be downloaded, and a PDF file could be created. By combining the traditional reading of a book (PDF) with the text mining of the OCR, very interesting things can take place. Thorough understanding could be obtained. Unfortunately, continually requesting individual pages seems laborious, not to mention, s l o w . It takes ten's of minutes to do the good work. Attached is the code I use to do the work. Can you suggest ways things could be sped up? Am I missing something when it comes to the API? Maybe if I do the work in a HathiTrust Research Center "capsule" things would be faster? -- Eric Morgan
#!/usr/bin/env perl # htid2txt.pl - given a HathiTrust identifier, output a plain (OCRed) text # Eric Lease Morgan <[email protected]> # (c) University of Notre Dame; distributed under a GNU Public License # February 10, 2019 - first cut # configure use constant KEY => ''; use constant SECRET => ''; use constant OCRREQUEST => 'https://babel.hathitrust.org/cgi/htd/volume/pageocr/'; use constant HTID => 'uva.x000274833'; # require use strict; use OAuth::Lite::Consumer; use OAuth::Lite::AuthMethod; # initialize my $consumer = OAuth::Lite::Consumer->new( consumer_key => KEY, consumer_secret => SECRET, auth_method => OAuth::Lite::AuthMethod::URL_QUERY, ); # continually get pages my $continue = 1; my $page = 0; my $ocrrequest = OCRREQUEST . HTID; while ( $continue ) { # increment and build request url $page++; my $url = $ocrrequest . '/' . $page; # request my $response = $consumer->request( method => 'GET', url => $url, params => { v => '2' } ); # output, conditionally if ( $response->code == '404' ) { last } else { print $response->content, "\n" } } # done exit;
