On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database.
Oy! If I had a nickel for every misguided coder who tried to scrape HTML with regexes... Some reasons why RE's are no good at parsing HTML: - tags can be mixed case - tags can have whitespace in many unexpected places - tags with no body can combine opening and closing tag with a '/' before the closing '>', as in "<BR/>" - tags can have attributes that you did not expect (like "<BR CLEAR=ALL>") - attributes can occur in any order within the tag - attribute names can also be in unexpected upper/lower case - attribute values can be enclosed in double quotes, single quotes, or even (surprise!) NO quotes For HTML that is machine-generated, you *may* be able to make some page-specific assumptions. But if edited by human hands, or if you are trying to make a generic page scraper, RE's will never cut it. -- Paul -- http://mail.python.org/mailman/listinfo/python-list