Dear Samuele,
FYI, with the attached patch for oai_harvest_getter.py I was able to
successfully harvest the CORDIS data from openaire.
It reads the environment variable 'http_proxy'.
If this exists it replaces the variable 'script' with 'server + script'
and the variable 'server' with the content of 'http_proxy',
removing/adding 'http://' at the respective places.
I did not add any logic concerning HTTPSConnection and did not test it for
https.
Best wishes,
Stefan
On Thu, 19 Jul 2012, Samuele Kaplun wrote:
Dear Stefan,
In data mercoledì, 18 luglio 2012 18.26:27, Stefan Hesselbach ha scritto:
From our Invenio server we can only access the Internet via a http proxy.
For most purposes this can be done using environment variables http_proxy,
HTTP_PROXY, ...
However, for OAI Harvest this seems not to work. Looking into
oai_harvest_getter.py I found that there httplib is used in the function
OAI_Request(). And searching a bit for use of proxies together with
httplib I found that httplib does not respect the environment variables
like http_proxy (as opposed to urllib). Is this true?
The way to use a proxy with httplib seems to be:
conn = httplib.HTTPConnection("my.proxy.host", "proxy_port")
conn.request("GET", "http://OAI.source.host:port/path")
instead of
conn = httplib.HTTPConnection("OAI.source.host", "port")
conn.request("GET", "/path")
Do you know an easy way to get OAI Harvest to work behind a proxy, i.e.
without hacking oai_harvest_getter.py?
I suspect the only way is really to extend oai_harvest_getter.py in the way
you suggest. We can either have Invenio look for HTTP_PROXY or move to use
urllib(2) instead of httplib.
Il ticketize this. Thanks for reporting it. We'll propose a patch ASAP.
Cheers!
Samuele
--
Stefan Hesselbach
Digital Library & Content Management
Informationstechnologie
Tel.: +49-6159-71-1787
E-Mail: [email protected]
GSI Helmholtzzentrum fuer Schwerionenforschung GmbH
Planckstr. 1
D-64291 Darmstadt
www.gsi.de
Gesellschaft mit beschraenkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528
Geschaeftsfuehrung: Professor Dr. Dr. h.c. mult. Horst Stoecker,
Peter Hassenbach, Dr. Hartmut Eickhoff
Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
--- oai_harvest_getter_orig.py 2012-07-19 14:24:47.000000000 +0200
+++ oai_harvest_getter_gsi.py 2012-07-20 11:04:01.000000000 +0200
@@ -33,6 +33,7 @@
import re
import time
import base64
+ import os
except ImportError, e:
print "Error: %s" % e
sys.exit(1)
@@ -267,6 +268,15 @@
"From": CFG_SITE_ADMIN_EMAIL,
"User-Agent":"Invenio %s" % CFG_VERSION}
+ proxy = os.getenv('http_proxy')
+ if proxy:
+ if proxy.startswith('http://'):
+ proxy = proxy[7:]
+ proxy = proxy.strip('/ ')
+ if len(proxy) > 0:
+ script = 'http://' + server + script
+ server = proxy
+
if password:
# We use basic authentication
headers["Authorization"] = "Basic " + base64.encodestring(user + ":" + password).strip()