Hello, everyone!

I am encountering an issue with my Python code that opens multiple HTML
pages and extracts elements from a specific class, if they exist. The code
runs fine for a few iterations but crashes after enough iterations, always
on the same HTML page. Interestingly, if I process this page individually,
the code works without any problems.

This issue has been bothering me for some time, and I have tried many
approaches.

When debugging, I noticed that the code always crashes within the *lxml*
package, which is used by *requests_html*. This could be a bug in the
library, but what could explain the crash only after several iterations?
<https://mail.google.com/mail/u/0?ui=2&ik=50de717578&attid=0.1&permmsgid=msg-a:r8588811151197014763&th=190e667ff637bb2e&view=att&disp=safe&realattid=f_lz0ai9pe0>

<https://mail.google.com/mail/u/0?ui=2&ik=50de717578&attid=0.1&permmsgid=msg-a:r8588811151197014763&th=190e667ff637bb2e&view=att&disp=safe&realattid=f_lz0ai9pe0>

A few more information:

Python              : sys.version_info(major=3, minor=8, micro=10,
releaselevel='final', serial=0)
lxml.etree          : (5, 2, 2, 0)
libxml used         : (2, 12, 6)
libxml compiled     : (2, 12, 6)
libxslt used        : (1, 1, 39)
libxslt compiled    : (1, 1, 39)

I'm attaching the code I mentioned. You can see that the code is quite
simple.

import pandas as pd
import requests_html

def main():
    df_links = pd.read_csv('./links.csv')

    session = requests_html.HTMLSession()

    for i in range(0, len(df_links.index)):
        url = df_links.iloc[i]['hyperlink']
        print(f"[{i}/{len(df_links.index)}]: {url}", flush=True)
        try:
            response = session.get(url)
            if response.status_code == 200:
                response_html = response.html
                dateList = response_html.find('relative-time')
        except Exception as e:
            print(f"Something went wrong: {e}", flush=True)

if __name__ == "__main__":
    main()

The crash always happens at the following line:

value = etree.fromstring(html, parser, **kw)

in the function document_fromstring in *lxml/html/__init__.py*.


Thank you for your time and help.


Best regards,
Kadu
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to