Dear Samuele,

FYI, with the attached patch for oai_harvest_getter.py I was able to successfully harvest the CORDIS data from openaire.

It reads the environment variable 'http_proxy'.
If this exists it replaces the variable 'script' with 'server + script'
and the variable 'server' with the content of 'http_proxy', removing/adding 'http://' at the respective places.

I did not add any logic concerning HTTPSConnection and did not test it for https.

Best wishes,
Stefan

On Thu, 19 Jul 2012, Samuele Kaplun wrote:

Dear Stefan,

In data mercoledì, 18 luglio 2012 18.26:27, Stefan Hesselbach ha scritto:
From our Invenio server we can only access the Internet via a http proxy.
For most purposes this can be done using environment variables http_proxy,
HTTP_PROXY, ...
However, for OAI Harvest this seems not to work. Looking into
oai_harvest_getter.py I found that there httplib is used in the function
OAI_Request(). And searching a bit for use of proxies together with
httplib I found that httplib does not respect the environment variables
like http_proxy (as opposed to urllib). Is this true?

The way to use a proxy with httplib seems to be:
conn = httplib.HTTPConnection("my.proxy.host", "proxy_port")
conn.request("GET", "http://OAI.source.host:port/path";)
instead of
conn = httplib.HTTPConnection("OAI.source.host", "port")
conn.request("GET", "/path")

Do you know an easy way to get OAI Harvest to work behind a proxy, i.e.
without hacking oai_harvest_getter.py?

I suspect the only way is really to extend oai_harvest_getter.py in the way
you suggest. We can either have Invenio look for HTTP_PROXY or move to use
urllib(2) instead of httplib.

Il ticketize this. Thanks for reporting it. We'll propose a patch ASAP.

Cheers!
        Samuele



--
Stefan Hesselbach
Digital Library & Content Management
Informationstechnologie

Tel.:   +49-6159-71-1787
E-Mail: [email protected]

GSI Helmholtzzentrum fuer Schwerionenforschung GmbH
Planckstr. 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschraenkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschaeftsfuehrung: Professor Dr. Dr. h.c. mult. Horst Stoecker,
Peter Hassenbach, Dr. Hartmut Eickhoff

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
--- oai_harvest_getter_orig.py	2012-07-19 14:24:47.000000000 +0200
+++ oai_harvest_getter_gsi.py	2012-07-20 11:04:01.000000000 +0200
@@ -33,6 +33,7 @@
     import re
     import time
     import base64
+    import os
 except ImportError, e:
     print "Error: %s" % e
     sys.exit(1)
@@ -267,6 +268,15 @@
                "From": CFG_SITE_ADMIN_EMAIL,
                "User-Agent":"Invenio %s" % CFG_VERSION}
 
+    proxy = os.getenv('http_proxy')
+    if proxy:
+        if proxy.startswith('http://'):
+            proxy = proxy[7:]
+        proxy = proxy.strip('/ ')
+        if len(proxy) > 0:
+            script = 'http://' + server + script
+            server = proxy
+
     if password:
         # We use basic authentication
         headers["Authorization"] = "Basic " + base64.encodestring(user + ":" + password).strip()

Reply via email to