Re: Use regular expression to retrieve all image tags from a given content

Tim Chase Tue, 03 Jul 2012 18:02:14 -0700

On 07/03/12 13:55, Melvyn Sopacua wrote:
> On 3-7-2012 20:38, Tim Chase wrote:
>>   <  img ... >
> 
> Which should fail.


It depends on what the OP is using it for.  If it's just for
extraction of images on the page to list them out, and such a tag
comes through, then the OP may be willing to let such
peculiarly-formed tags slide.  However, if the eventual purpose is
to prevent users from adding image tags to a text-area, then it's
perfectly valid (or at least widely accepted by multiple browsers).

>> Also, depending on the use-case (such as stripping them out of
>> validated code), a use-case such as
>>
>>   <i<img>mg src="evil.gif">
>>
>> could get part stripped out and leave the evil <img> tag in the text.
> 
> r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
> cases.

And yet I keep coming up with those corner cases, as would any
attacker that wanted to inject an image into the page (again, if the
goal is preventing image injection).  We could still have tags like

  < img class="spoon" src="evil.gif">

and since the HTML spec is pretty lazy/sloppy regarding ignoring
unknown tags, one can even have garbage attributes introduced
anywhere you want:

  <img micturations=plurdled src="evil.gif" gruntbuggly=freddled >

> The point is that if you want nothing but the tags (stripped or
> matched), regular expressions can do the job just fine.

The level of concern varies radically depending on whether one just
wants to extract/gather the sane(ish) image tags from a source, or
if the purpose is to sanitize input.  A rough estimate for
extraction could be done something like the following untested regexp:

 r = re.compile(r"""
    <            # tag opening
    \s*          # optional whitespace
    img          # the tag
    \b           # must end here
    (?:          # one of these things:
     \s+          # whitespace
     (?:[a-z][a-z0-9]+:)? # an optional namespace
     src          # a "src" attribute
     \s*          # optional whitespace
     =            # the equals sign
     \s*          # optional whitespace
     (            # capture the value
      "[^"]*"     # a double-quoted string
      |           # or
      '[^']*'     # a single-quoted string
      |           # or
      [^-a-z0-9._:]*" # per HTML spec
      )           # end of the captured src
    |            # or something that's not src
     \s+          # whitespace
     (?:[a-z][a-z0-9]+:)? # an optional namespace
     [a-z0-9]+    # the tag name
     (?:          # an optional value
      \s*         # optional whitespace
      =           # an assignment
      \s*         # optional whitespace
      (?:          # the value
       "[^"]*"     # a double-quoted string
       |           # or
       '[^']*'     # a single-quoted string
       |           # or
       [^-a-z0-9._:]*" # per HTML spec
       )           # end of the captured src
      )           # end of ignored attribute
     )*          # zero or more attributes
    \s*          # optional whitespace
    (?:/\s*)     # an optional self-closing
    >            # closing >
    """, re.I | re.VERBOSE)

So, that said, I'm not sure it's much more complex to use a real
parsing library :-)

> It's a trade-off you should make a decision on, not just blatantly
> dismiss regular expressions when a document contains tags or call them
> complex when they contain more then two characters.

I'm not blithely dismissing them, just making sure that

1) the use case is understood (I haven't heard the OP chime in
beyond the simple code snippet in the first post)

2) the complexity of adequately catching a wide variety of
edge-cases one can encounter when using regexps, and

3) parsing HTML/XML is often better left to battle-tested libraries,
as I'm sure there are missing bits in the above regex...things like
the actual allowable character-sets for attribute names, and there
might yet be some sociopathic code that could slip by, but this
catches most of the cases that occur from my understanding of the
HTML spec.

> The call can even be swayed in favor for either by the "I want to
> learn (regex|XML parsing)" argument.

THAT, I can't give you any grief about--I frequently use that
argument myself :-D

-tim





-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

Reply via email to