When we have patrons that try to download tens or hundreds of thousands
of pages -- not uncommonly, the vendor has software that notices the
'excessive' use, sends us an email reminding us that bulk downloading
violates our terms of service, and temporarily blacklists the IP address
(which could become more of a problem as we move to NAT/PAT where
everyone appears to the external internet as one of only a few external
IPs).
Granted, these users are usually downloading actual PDFs, not just
citations. I'm not really sure if when they are doing it for personal
research of some kind, or when they are doing it to share with off-shore
'pirate research paper' facilities (I'm not even making that up), but
the volume of use that triggers the vendors notices is such that it's
definitely an automated process of some kind, not just someone clicking
a lot.
Bulk downloading from our content vendors is usually prohibited by their
terms of service. So, beware.
On 11/14/13 10:30 AM, Eric Lease Morgan wrote:
Thank you for the replies, and after a bit of investigation I learned that I
don’t need to do authentication because the vendor does IP authentication.
Nice! On the other hand, I was still not able to resolve my original problem.
I needed/wanted to download ten’s of thousands, if not hundred’s of thousands of
citations for text mining analysis. The Web interface to the database/index limits
output to 4,000 items and selecting the set of these items is beyond tedious — it
is cruel and unusual punishment. I then got the idea of using EndNote’s z39.50
client, and after a bit of back & forth I got it working, but the downloading
process was too slow. I then got the bright idea of writing my own z39.50 client
(below). Unfortunately, I learned that the 4,000 record limit is more than that. A
person can only download the first 4,000 records in a found set. Requests for
record 4001, 4002, etc. fail. This is true in my locally written client as well as
in EndNote.
Alas, it looks as if I am unable to download the data I need/require, unless
somebody at the vendor give me a data dump. On the other hand, since my locally
written client is so short and simple, I think I can create a Web-based
interface to query many different z39.50 targets and provide on-the-fly text
mining analysis against the results.
In short, I learned a great many things.
—
Eric Lease Morgan
University of Notre Dame
#!/usr/bin/perl
# nytimes-search.pl - rudimentary z39.50 client to query the NY Times
# Eric Lease Morgan <emor...@nd.edu>
# November 13, 2013 - first cut; "Happy Birthday, Steve!"
# usage: ./nytimes-search.pl > nytimes.marc
# configure
use constant DB => 'hnpnewyorktimes';
use constant HOST => 'fedsearch.proquest.com';
use constant PORT => 210;
use constant QUERY => '@attr 1=1016 "trade or tariff"';
use constant SYNTAX => 'usmarc';
# require
use strict;
use ZOOM;
# do the work
eval {
# connect; configure; search
my $conn = new ZOOM::Connection( HOST, PORT, databaseName => DB );
$conn->option( preferredRecordSyntax => SYNTAX );
my $rs = $conn->search_pqf( QUERY );
# requests > 4000 return errors
# print $rs->record( 4001 )->raw;
# retrieve; will break at record 4,000 because of vendor limitations
for my $i ( 0 .. $rs->size ) {
print STDERR "\tRetrieving record #$i\r";
print $rs->record( $i )->raw;
}
};
# report errors
if ( $@ ) { print STDERR "Error ", $@->code, ": ", $@->message, "\n" }
# done
exit;