Re: Regular expression to structure HTML

Paul McGuire Fri, 02 Oct 2009 02:01:22 -0700

On Oct 2, 12:10 am, "[email protected]" <[email protected]> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
>
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.


Oy! If I had a nickel for every misguided coder who tried to scrape
HTML with regexes...

Some reasons why RE's are no good at parsing HTML:
- tags can be mixed case
- tags can have whitespace in many unexpected places
- tags with no body can combine opening and closing tag with a '/'
before the closing '>', as in "<BR/>"
- tags can have attributes that you did not expect (like "<BR
CLEAR=ALL>")
- attributes can occur in any order within the tag
- attribute names can also be in unexpected upper/lower case
- attribute values can be enclosed in double quotes, single quotes, or
even (surprise!) NO quotes

For HTML that is machine-generated, you *may* be able to make some
page-specific assumptions.  But if edited by human hands, or if you
are trying to make a generic page scraper, RE's will never cut it.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular expression to structure HTML

Reply via email to