Re: Use regular expression to retrieve all image tags from a given content
On 07/04/12 08:30, Melvyn Sopacua wrote: > On 4-7-2012 3:03, Tim Chase wrote: >> [snip Tim's obscene regex] > > Aside from the \b matching positive against ><, I'm not sure I follow...the \b just requires that a word-boundary occur there, preventing it from matching something like "imgood". It might even be vestigial as I originally had "*" rather than "+" for the whitespace before attributes, so it could likely be removed now without impacting the regexp. > My main beef with modern software is that for the simplest of things one > flees to full-blown libraries which happen to provide some utilities, > but the other 98% of the code from that library is unused. Case in > point, PIL to verify if a file is an image. But my rant alarm went off. :) I've done enough work where pathological client/user data comes through that I swear sometimes they're TRYING to break the app. Like the person who uploaded an Excel file with a .jpg extension to try and post the contained graph (re. your PIL image-detection comment). So it all boils down the use case, and how much work you want to spend maintaining it every time it breaks. If security is involved, I want the best-tested library I can get so I don't have to make all their mistakes myself. If it's just a dirty "get me adequate data as fast as possible" (especially if it's a one-off for a single data-source), then I'll often just hammer it out using whatever is easiest. > But the author I replied to did. Ah. I was looking for a 2nd post in the thread from the OP ("Mo Mughrabi") and didn't see anything. At least via gmane where I read the list. > I think you and I are on the same page. > Either way, the OP now has some nice examples of how to /refine/ regular > expressions and that's the real craft :). Amen! :-) -tim -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
On 4-7-2012 3:03, Tim Chase wrote: > r = re.compile(r""" > <# tag opening > \s* # optional whitespace > img # the tag > \b # must end here > (?: # one of these things: > \s+ # whitespace > (?:[a-z][a-z0-9]+:)? # an optional namespace > src # a "src" attribute > \s* # optional whitespace > =# the equals sign > \s* # optional whitespace > (# capture the value > "[^"]*" # a double-quoted string > | # or > '[^']*' # a single-quoted string > | # or > [^-a-z0-9._:]*" # per HTML spec > ) # end of the captured src > |# or something that's not src > \s+ # whitespace > (?:[a-z][a-z0-9]+:)? # an optional namespace > [a-z0-9]+# the tag name > (?: # an optional value > \s* # optional whitespace > = # an assignment > \s* # optional whitespace > (?: # the value >"[^"]*" # a double-quoted string >| # or >'[^']*' # a single-quoted string >| # or >[^-a-z0-9._:]*" # per HTML spec >) # end of the captured src > ) # end of ignored attribute > )* # zero or more attributes > \s* # optional whitespace > (?:/\s*) # an optional self-closing > ># closing > > """, re.I | re.VERBOSE) > > So, that said, I'm not sure it's much more complex to use a real > parsing library :-) Aside from the \b matching positive against ><, this is a syntax validator and of course when validating syntax you'll use a validating parser. For all other cases, you'd be more interested in "image tags with a src attribute", making the use-case quite a bit more simple. My main beef with modern software is that for the simplest of things one flees to full-blown libraries which happen to provide some utilities, but the other 98% of the code from that library is unused. Case in point, PIL to verify if a file is an image. But my rant alarm went off. :) >> It's a trade-off you should make a decision on, not just blatantly >> dismiss regular expressions when a document contains tags or call them >> complex when they contain more then two characters. > > I'm not blithely dismissing them, But the author I replied to did. I think you and I are on the same page. Either way, the OP now has some nice examples of how to /refine/ regular expressions and that's the real craft :). -- Melvyn Sopacua -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
On 07/03/12 13:55, Melvyn Sopacua wrote: > On 3-7-2012 20:38, Tim Chase wrote: >> < img ... > > > Which should fail. It depends on what the OP is using it for. If it's just for extraction of images on the page to list them out, and such a tag comes through, then the OP may be willing to let such peculiarly-formed tags slide. However, if the eventual purpose is to prevent users from adding image tags to a text-area, then it's perfectly valid (or at least widely accepted by multiple browsers). >> Also, depending on the use-case (such as stripping them out of >> validated code), a use-case such as >> >> >> >> could get part stripped out and leave the evil tag in the text. > > r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner > cases. And yet I keep coming up with those corner cases, as would any attacker that wanted to inject an image into the page (again, if the goal is preventing image injection). We could still have tags like < img class="spoon" src="evil.gif"> and since the HTML spec is pretty lazy/sloppy regarding ignoring unknown tags, one can even have garbage attributes introduced anywhere you want: > The point is that if you want nothing but the tags (stripped or > matched), regular expressions can do the job just fine. The level of concern varies radically depending on whether one just wants to extract/gather the sane(ish) image tags from a source, or if the purpose is to sanitize input. A rough estimate for extraction could be done something like the following untested regexp: r = re.compile(r""" <# tag opening \s* # optional whitespace img # the tag \b # must end here (?: # one of these things: \s+ # whitespace (?:[a-z][a-z0-9]+:)? # an optional namespace src # a "src" attribute \s* # optional whitespace =# the equals sign \s* # optional whitespace (# capture the value "[^"]*" # a double-quoted string | # or '[^']*' # a single-quoted string | # or [^-a-z0-9._:]*" # per HTML spec ) # end of the captured src |# or something that's not src \s+ # whitespace (?:[a-z][a-z0-9]+:)? # an optional namespace [a-z0-9]+# the tag name (?: # an optional value \s* # optional whitespace = # an assignment \s* # optional whitespace (?: # the value "[^"]*" # a double-quoted string | # or '[^']*' # a single-quoted string | # or [^-a-z0-9._:]*" # per HTML spec ) # end of the captured src ) # end of ignored attribute )* # zero or more attributes \s* # optional whitespace (?:/\s*) # an optional self-closing ># closing > """, re.I | re.VERBOSE) So, that said, I'm not sure it's much more complex to use a real parsing library :-) > It's a trade-off you should make a decision on, not just blatantly > dismiss regular expressions when a document contains tags or call them > complex when they contain more then two characters. I'm not blithely dismissing them, just making sure that 1) the use case is understood (I haven't heard the OP chime in beyond the simple code snippet in the first post) 2) the complexity of adequately catching a wide variety of edge-cases one can encounter when using regexps, and 3) parsing HTML/XML is often better left to battle-tested libraries, as I'm sure there are missing bits in the above regex...things like the actual allowable character-sets for attribute names, and there might yet be some sociopathic code that could slip by, but this catches most of the cases that occur from my understanding of the HTML spec. > The call can even be swayed in favor for either by the "I want to > learn (regex|XML parsing)" argument. THAT, I can't give you any grief about--I frequently use that argument myself :-D -tim -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
On 3-7-2012 20:38, Tim Chase wrote: > On 07/03/12 12:57, Melvyn Sopacua wrote: >> On 30-6-2012 15:23, Sunny Nanda wrote: >> What you're looking for is: >> prog = re.compile(r'') >> matches = re.search(prog) >> for match in matches : >> print match >> >>> On a sidenote, you should not be using regular expressions if you are doing >>> anything complex that what you are doing right now. >> >> This isn't complex. The email validator in django is complex. Using an >> XML parser for this is quite overkill. If you need several elements >> based on their nesting and/or sister elements, then an XML parser makes >> more sense, or better xpath queries. This is simple stuff for regular >> expressions and what they're made for. > > The reason for using a true parser is to avoid obscure edge cases. > Your example fails on both > > Which is easily corrected with either <[Ii][Mm][Gg] or case-insensitive. > > and > > < img ... > Which should fail. > Also, depending on the use-case (such as stripping them out of > validated code), a use-case such as > > > > could get part stripped out and leave the evil tag in the text. r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner cases. The point is that if you want nothing but the tags (stripped or matched), regular expressions can do the job just fine. It's actually more complex to do this with parsers, as you have to deal with syntax errors, keep state and rejoin the tags with the attributes for SAX based parsers and the only advantageous parser is a DOM tree, which has a large memory footprint on complex/large documents. It's a trade-off you should make a decision on, not just blatantly dismiss regular expressions when a document contains tags or call them complex when they contain more then two characters. The call can even be swayed in favor for either by the "I want to learn (regex|XML parsing)" argument. -- Melvyn Sopacua -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
On 07/03/12 12:57, Melvyn Sopacua wrote: > On 30-6-2012 15:23, Sunny Nanda wrote: > What you're looking for is: > prog = re.compile(r'') > matches = re.search(prog) > for match in matches : > print match > >> On a sidenote, you should not be using regular expressions if you are doing >> anything complex that what you are doing right now. > > This isn't complex. The email validator in django is complex. Using an > XML parser for this is quite overkill. If you need several elements > based on their nesting and/or sister elements, then an XML parser makes > more sense, or better xpath queries. This is simple stuff for regular > expressions and what they're made for. The reason for using a true parser is to avoid obscure edge cases. Your example fails on both and < img ... > Also, depending on the use-case (such as stripping them out of validated code), a use-case such as could get part stripped out and leave the evil tag in the text. -tkc -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
On 30-6-2012 15:23, Sunny Nanda wrote: > You can try the following two suggestions: > > 1. Try removing the "^" from the pattern and match only r" that the image tag might not be coming at the start of the string. That, and re.match is bound to the start of the string. See: http://docs.python.org/release/2.7.2/library/re.html#search-vs-match What you're looking for is: prog = re.compile(r'') matches = re.search(prog) for match in matches : print match > On a sidenote, you should not be using regular expressions if you are doing > anything complex that what you are doing right now. This isn't complex. The email validator in django is complex. Using an XML parser for this is quite overkill. If you need several elements based on their nesting and/or sister elements, then an XML parser makes more sense, or better xpath queries. This is simple stuff for regular expressions and what they're made for. -- Melvyn Sopacua -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
You can also use Pyquery for this purpose.. On Saturday, 30 June 2012 18:07:13 UTC+5:30, mo.mughrabi wrote: > > Hello, > > am really a noob with regular expressions, I tried to do this on my own > but I couldn't understand from the manuals how to approach it. Am trying to > find all img tags of a given content, I wrote the below but its returning > None > > content = i.content[0].value > > prog = re.compile(r'^ > result = prog.match(content) > > print result > > > any suggestions? > > -- You received this message because you are subscribed to the Google Groups "Django users" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/v0Q58xHQw7AJ. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
Why not use html parse lib? BeautifulSoup( http://www.crummy.com/software/BeautifulSoup/) for expample from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('put_youp_html_code_as_string') images = soup.find_all('img') If you need exactly regular expressions, watch this video: http://www.youtube.com/watch?v=kWyoYtvJpe4 суббота, 30 июня 2012 г., 20:37:13 UTC+8 пользователь mo.mughrabi написал: > > Hello, > > am really a noob with regular expressions, I tried to do this on my own > but I couldn't understand from the manuals how to approach it. Am trying to > find all img tags of a given content, I wrote the below but its returning > None > > content = i.content[0].value > > prog = re.compile(r'^ > result = prog.match(content) > > print result > > > any suggestions? > > -- You received this message because you are subscribed to the Google Groups "Django users" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/wAd5IjIA8ngJ. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Re: Use regular expression to retrieve all image tags from a given content
You can try the following two suggestions: 1. Try removing the "^" from the pattern and match only r" Hello, > > am really a noob with regular expressions, I tried to do this on my own > but I couldn't understand from the manuals how to approach it. Am trying to > find all img tags of a given content, I wrote the below but its returning > None > > content = i.content[0].value > > prog = re.compile(r'^ > result = prog.match(content) > > print result > > > any suggestions? > > -- You received this message because you are subscribed to the Google Groups "Django users" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/URj9ESCdOaYJ. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.