Re: Trimming X/HTML files

Thomas SMETS Sun, 31 Jul 2005 08:25:42 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The regular expression remove script out of an HTML/XHTML file is simple
enough but raises a major performance issue....

The following regular expression :
        r'(<script(\s*\S+\s*)+</script>)'
takes ages to complete in python on simple HTML file more than 3 minutes
of CPU time on a 150 lines HTML file. In jython it just never completes
but returns a painfull RunTimeException : maximum number of ??? reached.

Is the only way out dealing with strings and "match" instead of regular
expression ?
More over Jython is not yet 2.3 compliant, hence advanced features of
2.3 regular expression are not yet available !

\T,




Thomas SMETS wrote:
|
| Dear,
|
| I need to parse XHTML/HTML files in all ways :
| ~ _ Removing comments and javascripts is a first issue
| ~ _ Retrieving the list of fields to submit is my following item (todo)
|
| Any idea where I could find this already made ... ?
|
| \T,
|
|

- --
Thomas SMETS
Bruxelles
@ : [EMAIL PROTECTED]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFC7OkTqN0SJr+xLBURAuTYAKDLxLv+hpnSrZ6uowOmUczVxgxLqwCYhfJ3
fwjPZzg88gh3lNY8jkG3SA==
=urIC
-----END PGP SIGNATURE-----
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Trimming X/HTML files

Reply via email to