I've finally figured out how to get raw OCR text out of the HathiTrust API, but 
it is really slow. Any hints out there?

To use the HathiTrust Data API a person needs to first get a couple of access 
tokens. Applications then need to use the tokens to authenticate. Once this is 
done, a simple URL can be sent and cool stuff will be returned. For example, 
the following URL will return the first page of OCR:

  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2

By continually incrementing the URL, other pages can be gotten:

  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2

By incrementing the URL until an error is returned, one can get the whole of 
the document. I don't think there is a way to get the whole of the document in 
one go.

Similarly, a person can get page images:

  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2

Again, by incrementing the URL until an error is returned, all the images can 
be downloaded, and a PDF file could be created.

By combining the traditional reading of a book (PDF) with the text mining of 
the OCR, very interesting things can take place. Thorough understanding could 
be obtained.

Unfortunately, continually requesting individual pages seems laborious, not to 
mention,  s l o w .  It takes ten's of minutes to do the good work.

Attached is the code I use to do the work. Can you suggest ways things could be 
sped up? Am I missing something when it comes to the API? Maybe if I do the 
work in a HathiTrust Research Center "capsule" things would be faster? 

--
Eric Morgan

#!/usr/bin/env perl

# htid2txt.pl - given a HathiTrust identifier, output a plain (OCRed) text

# Eric Lease Morgan <[email protected]>
# (c) University of Notre Dame; distributed under a GNU Public License

# February 10, 2019 - first cut


# configure
use constant KEY        => '';
use constant SECRET     => ''; 
use constant OCRREQUEST => 'https://babel.hathitrust.org/cgi/htd/volume/pageocr/';
use constant HTID       => 'uva.x000274833';

# require
use strict;
use OAuth::Lite::Consumer;
use OAuth::Lite::AuthMethod;

# initialize
my $consumer = OAuth::Lite::Consumer->new(
		consumer_key    => KEY,
		consumer_secret => SECRET,
		auth_method     => OAuth::Lite::AuthMethod::URL_QUERY,
	);

# continually get pages
my $continue   = 1;
my $page       = 0;
my $ocrrequest = OCRREQUEST . HTID;
while ( $continue ) {

	# increment and build request url
	$page++;
	my $url = $ocrrequest . '/' . $page;

	# request
	my $response = $consumer->request(
			method  => 'GET',
			url     => $url,
			params  => { v => '2' }
		);

	# output, conditionally
	if ( $response->code == '404' ) { last }
	else { print $response->content, "\n" }
	
}

# done
exit;



Reply via email to