[CODE4LIB] hathitrust api

Eric Lease Morgan Sun, 10 Feb 2019 17:51:38 -0800

I've finally figured out how to get raw OCR text out of the HathiTrust API, but 
it is really slow. Any hints out there?


To use the HathiTrust Data API a person needs to first get a couple of access 
tokens. Applications then need to use the tokens to authenticate. Once this is 
done, a simple URL can be sent and cool stuff will be returned. For example, 
the following URL will return the first page of OCR:

  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/1?v=2

By continually incrementing the URL, other pages can be gotten:

  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/2?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/3?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageocr/uva.x000274833/4?v=2

By incrementing the URL until an error is returned, one can get the whole of 
the document. I don't think there is a way to get the whole of the document in 
one go.

Similarly, a person can get page images:

  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/1?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/2?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/3?v=2
  https://babel.hathitrust.org/cgi/htd/volume/pageimage/uva.x000274833/4?v=2

Again, by incrementing the URL until an error is returned, all the images can 
be downloaded, and a PDF file could be created.

By combining the traditional reading of a book (PDF) with the text mining of 
the OCR, very interesting things can take place. Thorough understanding could 
be obtained.

Unfortunately, continually requesting individual pages seems laborious, not to 
mention,  s l o w .  It takes ten's of minutes to do the good work.

Attached is the code I use to do the work. Can you suggest ways things could be 
sped up? Am I missing something when it comes to the API? Maybe if I do the 
work in a HathiTrust Research Center "capsule" things would be faster? 

--
Eric Morgan

#!/usr/bin/env perl

# htid2txt.pl - given a HathiTrust identifier, output a plain (OCRed) text

# Eric Lease Morgan <[email protected]>
# (c) University of Notre Dame; distributed under a GNU Public License

# February 10, 2019 - first cut


# configure
use constant KEY        => '';
use constant SECRET     => ''; 
use constant OCRREQUEST => 'https://babel.hathitrust.org/cgi/htd/volume/pageocr/';
use constant HTID       => 'uva.x000274833';

# require
use strict;
use OAuth::Lite::Consumer;
use OAuth::Lite::AuthMethod;

# initialize
my $consumer = OAuth::Lite::Consumer->new(
		consumer_key    => KEY,
		consumer_secret => SECRET,
		auth_method     => OAuth::Lite::AuthMethod::URL_QUERY,
	);

# continually get pages
my $continue   = 1;
my $page       = 0;
my $ocrrequest = OCRREQUEST . HTID;
while ( $continue ) {

	# increment and build request url
	$page++;
	my $url = $ocrrequest . '/' . $page;

	# request
	my $response = $consumer->request(
			method  => 'GET',
			url     => $url,
			params  => { v => '2' }
		);

	# output, conditionally
	if ( $response->code == '404' ) { last }
	else { print $response->content, "\n" }
	
}

# done
exit;

[CODE4LIB] hathitrust api

Reply via email to