[issue43741] http.client leaks from self.fp.read()

Hynek Petrak Tue, 06 Apr 2021 00:43:56 -0700


New submission from Hynek Petrak <[email protected]>:


Hi, I wrote an webcrawler, which is using ThreadPoolExecutor to span multiple 
thread workers, retrieve content of a web using via http.client and saves it to 
a file.
After a couple of thousands requests have been processes, the crawler starts to 
consume memory rapidly, resulting in consumption of all available memory.
tracemalloc shows the memory is not collected from:
/usr/lib/python3.9/http/client.py:468: size=47.6 MiB, count=6078, average=8221 B
  File "/usr/lib/python3.9/http/client.py", line 468
    s = self.fp.read()

I have tested as well with requests and urllib3 and as they use http.client 
underneath, the result is always the same.

My code around that:
def get_html3(session, url, timeout=10):
    o = urlparse(url)
    if o.scheme == 'http':
        cn = http.client.HTTPConnection(o.netloc, timeout=timeout)
    else:
        cn = http.client.HTTPSConnection(o.netloc, context=ctx, timeout=timeout)
    cn.request('GET', o.path, headers=headers)
    r = cn.getresponse()
    log.debug(f'[*] [{url}] Status: {r.status} {r.reason}')
    if r.status not in [400, 403, 404]:
        ret = r.read().decode('utf-8')
    else:
        ret = ""
    r.close()
    del r
    cn.close()
    del cn
    return ret

----------
messages: 390287
nosy: HynekPetrak
priority: normal
severity: normal
status: open
title: http.client leaks from self.fp.read()
type: crash
versions: Python 3.9

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43741>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43741] http.client leaks from self.fp.read()

Reply via email to