New submission from Hynek Petrak <hynek.pet...@gmail.com>:
Hi, I wrote an webcrawler, which is using ThreadPoolExecutor to span multiple thread workers, retrieve content of a web using via http.client and saves it to a file. After a couple of thousands requests have been processes, the crawler starts to consume memory rapidly, resulting in consumption of all available memory. tracemalloc shows the memory is not collected from: /usr/lib/python3.9/http/client.py:468: size=47.6 MiB, count=6078, average=8221 B File "/usr/lib/python3.9/http/client.py", line 468 s = self.fp.read() I have tested as well with requests and urllib3 and as they use http.client underneath, the result is always the same. My code around that: def get_html3(session, url, timeout=10): o = urlparse(url) if o.scheme == 'http': cn = http.client.HTTPConnection(o.netloc, timeout=timeout) else: cn = http.client.HTTPSConnection(o.netloc, context=ctx, timeout=timeout) cn.request('GET', o.path, headers=headers) r = cn.getresponse() log.debug(f'[*] [{url}] Status: {r.status} {r.reason}') if r.status not in [400, 403, 404]: ret = r.read().decode('utf-8') else: ret = "" r.close() del r cn.close() del cn return ret ---------- messages: 390287 nosy: HynekPetrak priority: normal severity: normal status: open title: http.client leaks from self.fp.read() type: crash versions: Python 3.9 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43741> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com