When we have patrons that try to download tens or hundreds of thousands of pages -- not uncommonly, the vendor has software that notices the 'excessive' use, sends us an email reminding us that bulk downloading violates our terms of service, and temporarily blacklists the IP address (which could become more of a problem as we move to NAT/PAT where everyone appears to the external internet as one of only a few external IPs).

Granted, these users are usually downloading actual PDFs, not just citations. I'm not really sure if when they are doing it for personal research of some kind, or when they are doing it to share with off-shore 'pirate research paper' facilities (I'm not even making that up), but the volume of use that triggers the vendors notices is such that it's definitely an automated process of some kind, not just someone clicking a lot.

Bulk downloading from our content vendors is usually prohibited by their terms of service. So, beware.

On 11/14/13 10:30 AM, Eric Lease Morgan wrote:
Thank you for the replies, and after a bit of investigation I learned that I 
don’t need to do authentication because the vendor does IP authentication. 
Nice! On the other hand, I was still not able to resolve my original problem.

I needed/wanted to download ten’s of thousands, if not hundred’s of thousands of 
citations for text mining analysis. The Web interface to the database/index limits 
output to 4,000 items and selecting the set of these items is beyond tedious — it 
is cruel and unusual punishment. I then got the idea of using EndNote’s z39.50 
client, and after a bit of back & forth I got it working, but the downloading 
process was too slow. I then got the bright idea of writing my own z39.50 client 
(below). Unfortunately, I learned that the 4,000 record limit is more than that. A 
person can only download the first 4,000 records in a found set. Requests for 
record 4001, 4002, etc. fail. This is true in my locally written client as well as 
in EndNote.

Alas, it looks as if I am unable to download the data I need/require, unless 
somebody at the vendor give me a data dump. On the other hand, since my locally 
written client is so short and simple, I think I can create a Web-based 
interface to query many different z39.50 targets and provide on-the-fly text 
mining analysis against the results.

In short, I learned a great many things.

—
Eric Lease Morgan
University of Notre Dame


#!/usr/bin/perl

# nytimes-search.pl - rudimentary z39.50 client to query the NY Times

# Eric Lease Morgan <emor...@nd.edu>
# November 13, 2013 - first cut; "Happy Birthday, Steve!"

# usage: ./nytimes-search.pl > nytimes.marc


# configure
use constant DB     => 'hnpnewyorktimes';
use constant HOST   => 'fedsearch.proquest.com';
use constant PORT   => 210;
use constant QUERY  => '@attr 1=1016 "trade or tariff"';
use constant SYNTAX => 'usmarc';

# require
use strict;
use ZOOM;

# do the work
eval {

        # connect; configure; search
        my $conn = new ZOOM::Connection( HOST, PORT, databaseName => DB );
        $conn->option( preferredRecordSyntax => SYNTAX );
        my $rs = $conn->search_pqf( QUERY );

        # requests > 4000 return errors
        # print $rs->record( 4001 )->raw;
                        
        # retrieve; will break at record 4,000 because of vendor limitations
        for my $i ( 0 .. $rs->size ) {
        
                print STDERR "\tRetrieving record #$i\r";
                print $rs->record( $i )->raw;
                
        }
                
};

# report errors
if ( $@ ) { print STDERR "Error ", $@->code, ": ", $@->message, "\n" }

# done
exit;


Reply via email to