subject:"Use regular expression to retrieve all image tags from a given content"

Re: Use regular expression to retrieve all image tags from a given content

2012-07-04 Thread Tim Chase

On 07/04/12 08:30, Melvyn Sopacua wrote:
> On 4-7-2012 3:03, Tim Chase wrote:
>>  [snip Tim's obscene regex]
> 
> Aside from the \b matching positive against ><,

I'm not sure I follow...the \b just requires that a word-boundary
occur there, preventing it from matching something like "imgood".
It might even be vestigial as I originally had "*" rather than "+"
for the whitespace before attributes, so it could likely be removed
now without impacting the regexp.

> My main beef with modern software is that for the simplest of things one
> flees to full-blown libraries which happen to provide some utilities,
> but the other 98% of the code from that library is unused. Case in
> point, PIL to verify if a file is an image. But my rant alarm went off. :)

I've done enough work where pathological client/user data comes
through that I swear sometimes they're TRYING to break the app.
Like the person who uploaded an Excel file with a .jpg extension to
try and post the contained graph (re. your PIL image-detection comment).

So it all boils down the use case, and how much work you want to
spend maintaining it every time it breaks.  If security is involved,
I want the best-tested library I can get so I don't have to make all
their mistakes myself.  If it's just a dirty "get me adequate data
as fast as possible" (especially if it's a one-off for a single
data-source), then I'll often just hammer it out using whatever is
easiest.

> But the author I replied to did.

Ah.  I was looking for a 2nd post in the thread from the OP ("Mo
Mughrabi") and didn't see anything.  At least via gmane where I read
the list.

> I think you and I are on the same page.
> Either way, the OP now has some nice examples of how to /refine/ regular
> expressions and that's the real craft :).

Amen!  :-)

-tim

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-04 Thread Melvyn Sopacua

On 4-7-2012 3:03, Tim Chase wrote:

>  r = re.compile(r"""
> <# tag opening
> \s*  # optional whitespace
> img  # the tag
> \b   # must end here
> (?:  # one of these things:
>  \s+  # whitespace
>  (?:[a-z][a-z0-9]+:)? # an optional namespace
>  src  # a "src" attribute
>  \s*  # optional whitespace
>  =# the equals sign
>  \s*  # optional whitespace
>  (# capture the value
>   "[^"]*" # a double-quoted string
>   |   # or
>   '[^']*' # a single-quoted string
>   |   # or
>   [^-a-z0-9._:]*" # per HTML spec
>   )   # end of the captured src
> |# or something that's not src
>  \s+  # whitespace
>  (?:[a-z][a-z0-9]+:)? # an optional namespace
>  [a-z0-9]+# the tag name
>  (?:  # an optional value
>   \s* # optional whitespace
>   =   # an assignment
>   \s* # optional whitespace
>   (?:  # the value
>"[^"]*" # a double-quoted string
>|   # or
>'[^']*' # a single-quoted string
>|   # or
>[^-a-z0-9._:]*" # per HTML spec
>)   # end of the captured src
>   )   # end of ignored attribute
>  )*  # zero or more attributes
> \s*  # optional whitespace
> (?:/\s*) # an optional self-closing
> ># closing >
> """, re.I | re.VERBOSE)
> 
> So, that said, I'm not sure it's much more complex to use a real
> parsing library :-)

Aside from the \b matching positive against ><, this is a syntax
validator and of course when validating syntax you'll use a validating
parser. For all other cases, you'd be more interested in "image tags
with a src attribute", making the use-case quite a bit more simple.

My main beef with modern software is that for the simplest of things one
flees to full-blown libraries which happen to provide some utilities,
but the other 98% of the code from that library is unused. Case in
point, PIL to verify if a file is an image. But my rant alarm went off. :)

>> It's a trade-off you should make a decision on, not just blatantly
>> dismiss regular expressions when a document contains tags or call them
>> complex when they contain more then two characters.
> 
> I'm not blithely dismissing them,

But the author I replied to did. I think you and I are on the same page.
Either way, the OP now has some nice examples of how to /refine/ regular
expressions and that's the real craft :).

-- 
Melvyn Sopacua


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-03 Thread Tim Chase

On 07/03/12 13:55, Melvyn Sopacua wrote:
> On 3-7-2012 20:38, Tim Chase wrote:
>>   <  img ... >
> 
> Which should fail.

It depends on what the OP is using it for.  If it's just for
extraction of images on the page to list them out, and such a tag
comes through, then the OP may be willing to let such
peculiarly-formed tags slide.  However, if the eventual purpose is
to prevent users from adding image tags to a text-area, then it's
perfectly valid (or at least widely accepted by multiple browsers).

>> Also, depending on the use-case (such as stripping them out of
>> validated code), a use-case such as
>>
>>   
>>
>> could get part stripped out and leave the evil  tag in the text.
> 
> r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
> cases.

And yet I keep coming up with those corner cases, as would any
attacker that wanted to inject an image into the page (again, if the
goal is preventing image injection).  We could still have tags like

  < img class="spoon" src="evil.gif">

and since the HTML spec is pretty lazy/sloppy regarding ignoring
unknown tags, one can even have garbage attributes introduced
anywhere you want:

  

> The point is that if you want nothing but the tags (stripped or
> matched), regular expressions can do the job just fine.

The level of concern varies radically depending on whether one just
wants to extract/gather the sane(ish) image tags from a source, or
if the purpose is to sanitize input.  A rough estimate for
extraction could be done something like the following untested regexp:

 r = re.compile(r"""
<# tag opening
\s*  # optional whitespace
img  # the tag
\b   # must end here
(?:  # one of these things:
 \s+  # whitespace
 (?:[a-z][a-z0-9]+:)? # an optional namespace
 src  # a "src" attribute
 \s*  # optional whitespace
 =# the equals sign
 \s*  # optional whitespace
 (# capture the value
  "[^"]*" # a double-quoted string
  |   # or
  '[^']*' # a single-quoted string
  |   # or
  [^-a-z0-9._:]*" # per HTML spec
  )   # end of the captured src
|# or something that's not src
 \s+  # whitespace
 (?:[a-z][a-z0-9]+:)? # an optional namespace
 [a-z0-9]+# the tag name
 (?:  # an optional value
  \s* # optional whitespace
  =   # an assignment
  \s* # optional whitespace
  (?:  # the value
   "[^"]*" # a double-quoted string
   |   # or
   '[^']*' # a single-quoted string
   |   # or
   [^-a-z0-9._:]*" # per HTML spec
   )   # end of the captured src
  )   # end of ignored attribute
 )*  # zero or more attributes
\s*  # optional whitespace
(?:/\s*) # an optional self-closing
># closing >
""", re.I | re.VERBOSE)

So, that said, I'm not sure it's much more complex to use a real
parsing library :-)

> It's a trade-off you should make a decision on, not just blatantly
> dismiss regular expressions when a document contains tags or call them
> complex when they contain more then two characters.

I'm not blithely dismissing them, just making sure that

1) the use case is understood (I haven't heard the OP chime in
beyond the simple code snippet in the first post)

2) the complexity of adequately catching a wide variety of
edge-cases one can encounter when using regexps, and

3) parsing HTML/XML is often better left to battle-tested libraries,
as I'm sure there are missing bits in the above regex...things like
the actual allowable character-sets for attribute names, and there
might yet be some sociopathic code that could slip by, but this
catches most of the cases that occur from my understanding of the
HTML spec.

> The call can even be swayed in favor for either by the "I want to
> learn (regex|XML parsing)" argument.

THAT, I can't give you any grief about--I frequently use that
argument myself :-D

-tim





-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-03 Thread Melvyn Sopacua

On 3-7-2012 20:38, Tim Chase wrote:
> On 07/03/12 12:57, Melvyn Sopacua wrote:
>> On 30-6-2012 15:23, Sunny Nanda wrote:
>> What you're looking for is:
>> prog = re.compile(r'')
>> matches = re.search(prog)
>> for match in matches :
>>  print match
>>
>>> On a sidenote, you should not be using regular expressions if you are doing 
>>> anything complex that what you are doing right now.
>>
>> This isn't complex. The email validator in django is complex. Using an
>> XML parser for this is quite overkill. If you need several elements
>> based on their nesting and/or sister elements, then an XML parser makes
>> more sense, or better xpath queries. This is simple stuff for regular
>> expressions and what they're made for.
> 
> The reason for using a true parser is to avoid obscure edge cases.
> Your example fails on both
> 
>   

Which is easily corrected with either <[Ii][Mm][Gg] or case-insensitive.
> 
> and
> 
>   <  img ... >

Which should fail.

> Also, depending on the use-case (such as stripping them out of
> validated code), a use-case such as
> 
>   
> 
> could get part stripped out and leave the evil  tag in the text.

r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
cases. The point is that if you want nothing but the tags (stripped or
matched), regular expressions can do the job just fine. It's actually
more complex to do this with parsers, as you have to deal with syntax
errors, keep state and rejoin the tags with the attributes for SAX based
parsers and the only advantageous parser is a DOM tree, which has a
large memory footprint on complex/large documents.
It's a trade-off you should make a decision on, not just blatantly
dismiss regular expressions when a document contains tags or call them
complex when they contain more then two characters. The call can even be
swayed in favor for either by the "I want to learn (regex|XML parsing)"
argument.
-- 
Melvyn Sopacua


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-03 Thread Tim Chase

On 07/03/12 12:57, Melvyn Sopacua wrote:
> On 30-6-2012 15:23, Sunny Nanda wrote:
> What you're looking for is:
> prog = re.compile(r'')
> matches = re.search(prog)
> for match in matches :
>   print match
> 
>> On a sidenote, you should not be using regular expressions if you are doing 
>> anything complex that what you are doing right now.
> 
> This isn't complex. The email validator in django is complex. Using an
> XML parser for this is quite overkill. If you need several elements
> based on their nesting and/or sister elements, then an XML parser makes
> more sense, or better xpath queries. This is simple stuff for regular
> expressions and what they're made for.

The reason for using a true parser is to avoid obscure edge cases.
Your example fails on both

  

and

  <  img ... >

Also, depending on the use-case (such as stripping them out of
validated code), a use-case such as

  

could get part stripped out and leave the evil  tag in the text.

-tkc



-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-03 Thread Melvyn Sopacua

On 30-6-2012 15:23, Sunny Nanda wrote:
> You can try the following two suggestions:
> 
> 1. Try removing the "^" from the pattern and match only r" that the image tag might not be coming at the start of the string.

That, and re.match is bound to the start of the string. See:
http://docs.python.org/release/2.7.2/library/re.html#search-vs-match

What you're looking for is:
prog = re.compile(r'')
matches = re.search(prog)
for match in matches :
print match

> On a sidenote, you should not be using regular expressions if you are doing 
> anything complex that what you are doing right now.

This isn't complex. The email validator in django is complex. Using an
XML parser for this is quite overkill. If you need several elements
based on their nesting and/or sister elements, then an XML parser makes
more sense, or better xpath queries. This is simple stuff for regular
expressions and what they're made for.

-- 
Melvyn Sopacua

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-07-01 Thread pankaj anand

You can also use Pyquery for this purpose..

On Saturday, 30 June 2012 18:07:13 UTC+5:30, mo.mughrabi wrote:
>
> Hello, 
>
> am really a noob with regular expressions, I tried to do this on my own 
> but I couldn't understand from the manuals how to approach it. Am trying to 
> find all img tags of a given content, I wrote the below but its returning 
> None
>
>  content = i.content[0].value
>
> prog = re.compile(r'^
> result = prog.match(content)
>
> print result
>
>
>  any suggestions?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-users/-/v0Q58xHQw7AJ.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-06-30 Thread Maksim Schepelin

Why not use html parse lib? BeautifulSoup(
http://www.crummy.com/software/BeautifulSoup/) for expample

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('put_youp_html_code_as_string')
images = soup.find_all('img')

If you need exactly regular expressions, watch this video: 
http://www.youtube.com/watch?v=kWyoYtvJpe4

суббота, 30 июня 2012 г., 20:37:13 UTC+8 пользователь mo.mughrabi написал:
>
> Hello, 
>
> am really a noob with regular expressions, I tried to do this on my own 
> but I couldn't understand from the manuals how to approach it. Am trying to 
> find all img tags of a given content, I wrote the below but its returning 
> None
>
>  content = i.content[0].value
>
> prog = re.compile(r'^
> result = prog.match(content)
>
> print result
>
>
>  any suggestions?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-users/-/wAd5IjIA8ngJ.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

2012-06-30 Thread Sunny Nanda

You can try the following two suggestions:

1. Try removing the "^" from the pattern and match only r" Hello, 
>
> am really a noob with regular expressions, I tried to do this on my own 
> but I couldn't understand from the manuals how to approach it. Am trying to 
> find all img tags of a given content, I wrote the below but its returning 
> None
>
>  content = i.content[0].value
>
> prog = re.compile(r'^
> result = prog.match(content)
>
> print result
>
>
>  any suggestions?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-users/-/URj9ESCdOaYJ.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

Re: Use regular expression to retrieve all image tags from a given content

9 matches

Site Navigation

Mail list logo

Footer information