Thank you Andrew, that is insanely useful.

cheers
stuart

On 30/01/14 12:00, Andrew Anderson wrote:
When OCLC first announced their purchase of EZproxy, we started a low priority 
research project to see what the alternatives were a few years ago, and what it 
would take to bring them into a production ready state.  The two open source 
solutions we evaluated were Squid and Apache HTTPd.  We considered other 
options (e.g. Apache Traffic Server), but limited the research to these two 
pieces of software since they are already widely used and familiar to most 
system administrators.

Long story short, Squid did not support URL rewriting in a way that we felt 
would be able to be supported well, between requiring patches to the core C++ 
server code, or an external rewriting processes, or an ICAP server 
implementation.  Some of that has improved a bit since the original evaluation, 
but the built-in support for URL rewriting may still need some time to mature.  
Another aspect of Squid that did not seem to be a good fit was that it is 
somewhat limited in its authentication mechanisms vs Apache HTTPd.

So we moved on to evaluating Apache HTTPd with the mod_proxy family of modules. 
 While Apache HTTPd does not support the advanced cache federation features as 
Squid, it has grown to be a robust proxy solution in its own right, and the 2.4 
release appears to have all of the required pieces out of the box, with the 
mod_proxy_html module functionality.  In addition to basic URL rewriting 
support, you get full HTTP protocol support, mature IPv6 support, GZIP support, 
just about any authentication mechanism you need, a server that you can 
self-host content with easily, as well as a built-in HTTP object cache.

How would it work?

Here’s the current EZproxy stanza for ProQuest:

HTTPHeader X-Requested-With
HTTPHeader Accept-Encoding
Title ProQuest
URL http://search.proquest.com/ip
DJ proquest.com
HJ gateway.proquest.com
DJ umi.com
HJ fedsearch.proquest.com
HJ literature.proquest.com
DJ conquest-leg-insight.com
DJ conquestsystems.com
DJ m.search.proquest.com
DJ media.proquest.com
NeverProxy order.proquest.com
NeverProxy rss.proquest.com

Here’s an Apache HTTPd configuration using ProQuest that accomplishes much of 
the same functionality for the main search.proquest.com interface:

<VirtualHost _default_:80>
  ServerName search.proquest.com.fqdn

  ProxyRequests Off
  ProxyVia On

  RewriteEngine On
  RewriteRule ^/(.*) http://search.proquest.com/$1 [P]

  <Location “/“>
   AllowMethods GET POST OPTIONS
   ProxyPassReverse http://search.proquest.com/
   ProxyPassReverseCookieDomain search.proquest.com search.proquest.com.fqdn
   CacheEnable disk
   SetOutputFilter INFLATE;DEFLATE
   Header Append Vary User-Agent env=!dont-vary
   # Put Authentication directives here
   ErrorDocument 401 /path/to/login
   Require Valid-User
  </Location>
</virtualHost>

A few notes on this:

- There is no need for NeverProxy: if you do not define a VirtualHost for the 
hostname, it is not proxied.  So instead of HJ and DJ lines, you add a new 
VirtualHost block for each hostname that needs to be proxied.  The astute will 
ask “what about services that have dozens or hundreds of host entries, like 
Sage?”  Those can be handled by the ProxyExpress features in Apache HTTPd.

- There is no need for HTTPHeader: since Apache HTTPd is a full HTTP 
proxy/server, it supports all HTTP headers natively.

- Some of the hostnames that are in EZproxy stanzas are not needed, and some 
are legacy hostnames that are no longer used by the vendor

- Some of the hostnames that are in EZproxy stanzas are for CDN hosted content 
that requires no special access (e.g. JavaScript/CSS/graphics assets that make 
up the vendor’s user interface).  Another example: how many of you have “DJ 
google.com” in one of your stanzas? Now how many of you registered your IP 
addresses with Google in any way?  Outside of Google Scholar, I suspect the 
answer to those questions are “nearly everyone” and “nearly no one”, 
respectively.

- Some of the hostnames are for things that no sane person would do: How many 
people run their discovery services through their EZproxy server vs. 
authenticating their discovery platform by IP address with vendors directly?

- Something that this configuration does that EZproxy does not do is enable 
object caching.  This can easily save 30-50% of your upstream bandwidth usage 
(Proxy/ProxySSL in EZproxy can achieve the same result with an external caching 
proxy server).

- More complex vendor platforms (e.g. Gale Cengage) need ProxyHTML directives 
and ProxyHTMLURLMap configured, and multiple VirtualHost sections to get them 
fully working.  These can be a little fun to get working initially.

- Some services need redirects edited to work correctly, and not break out of 
the proxy:

        Header edit Location http://vendor/ http://vendor.fqdn/

- Some vendors send wrong HTTP headers for the MIME type, and mod_proxy_html 
exposes this in some cases as it rewrites the page.  There may be a better way 
to do this, but this is what I threw together for testing:

        <Location “/badpath”>
                ProxyHTMLEnable Off
                SetOutputFilter INFLATE;dummy-html-to-plain
                ExtFilterOptions LogStdErr Onfail=remove
        </Location>
        ExtFilterDefine dummy-html-to-plain mode=output intype=text/html 
outtype=text/plain cmd=“/bin/cat -“

So what’s currently missing in the Apache HTTPd solution?

- Services that use an authentication token (predominantly ebook vendors) need 
special support written.  I have been entertaining using mod_lua for this to 
make this support relatively easy for someone who is not hard-core technical to 
maintain.

- Services that are not IP authenticated, but use one of the Form-based 
authentication variants.  I suspect that an approach that injects a script tag 
into the page pointing to javascript that handles the form fill/submission 
might be a sane approach here.  This should also cleanly deal with the ASP.net 
abominations that use __PAGESTATE to store sessions client-side instead of 
server-side.

- EZproxy’s built-in DNS server (enabled with the “DNS” directive) would need 
to be handled using a separate DNS server (there are several options to choose 
from).

- In this setup, standard systems-level management and reporting tools would be 
used instead of the /admin interface in EZproxy

- In this setup, the functionality of the EZproxy /menu URL would need to be 
handled externally.  This may not be a real issue, as many academic sites 
already use LMS or portal systems instead of the EZproxy to direct students to 
resources, so this feature may not be as critical to replicate.

- And of course, extensive testing.  While the above ProQuest stanza works for 
the main ProQuest search interface, it won’t work for everyone, everywhere just 
yet.

Bottom line: Yes, Apache HTTPd is a viable EZproxy alternative if you have a 
system administrator who knows their way around Apache HTTPd, and are willing 
to spend some time getting to know your vendor services intimately.

All of this testing was done on Fedora 19 for the 2.4 version of HTTPd, which 
should be available in RHEL7/CentOS7 soon, so about the time that hard 
decisions are to be made regarding EZproxy vs something else, that something 
else may very well be Apache HTTPd with vendor-specific configuration files.



--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/

Reply via email to