Re: Truncating an html string safely
On Sat, Jun 7, 2008 at 7:24 AM, Matt Feifarek [EMAIL PROTECTED] wrote: Oops; replied from the wrong address. -- Forwarded message -- On Thu, Jun 5, 2008 at 2:36 PM, Ian Bicking [EMAIL PROTECTED] wrote: Well... it's hard to truncate exactly, as there's all that annoying nesting stuff. An untested attempt with lxml: Exactly. Thanks for the lead. I'm not sure I'm up to the challenge, but if I do get it working, I'll get it back to you, in case it's good enough to be added to lxml (or whatever). Mike: Seems like if we have the truncate function in webhelpers, a truncate that handles html would be wise... since we're, err, making html, usually, with Pylons. Since the Django code doesn't seem to depend on anything (but some Django cruft, which seems to be frosting really) MAYBE it would be better to start with. But I'll poke around a bit today. It would be fun to write a SAX handler that permits all tags, and counts all characters. It would stop permitting additional characters once it reached a certain limit. -jj -- I, for one, welcome our new Facebook overlords! http://jjinux.blogspot.com/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Jun 12, 5:13 am, Mike Orr [EMAIL PROTECTED] wrote: Although again, we have two issues. One is HTML-to-text (essentially lynx-as-a-function). The other is truncating an HTML string while keeping it well-formed (which means not stopping in the middle of a tag and closing any open tags). Here is another sanitizer (I think from something having to do with Zope): http://www.koders.com/python/fidFB51F4D2D89CC1397608213E09F11404D9B21059.aspx --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Jun 12, 5:13 am, Mike Orr [EMAIL PROTECTED] wrote: Although again, we have two issues. One is HTML-to-text (essentially lynx-as-a-function). The other is truncating an HTML string while keeping it well-formed (which means not stopping in the middle of a tag and closing any open tags). Actually, I think we may have four issues...? 1) truncate HTML and end up with well-formed HTML. 2) strip all HTML tags (without an interest in text formatting) 3) html2text (trying to keep text formatting with p, block, etc.) 4) sanitizing HTML (not directly discussed here, but a good implementation of this will be helpful, increase security, and should be able to be extended trivially to provide #2, striping all HTML tags). --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
Oops; replied from the wrong address. -- Forwarded message -- On Thu, Jun 5, 2008 at 2:36 PM, Ian Bicking [EMAIL PROTECTED] wrote: Well... it's hard to truncate exactly, as there's all that annoying nesting stuff. An untested attempt with lxml: Exactly. Thanks for the lead. I'm not sure I'm up to the challenge, but if I do get it working, I'll get it back to you, in case it's good enough to be added to lxml (or whatever). Mike: Seems like if we have the truncate function in webhelpers, a truncate that handles html would be wise... since we're, err, making html, usually, with Pylons. Since the Django code doesn't seem to depend on anything (but some Django cruft, which seems to be frosting really) MAYBE it would be better to start with. But I'll poke around a bit today. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote: I'd like to use something like the truncate feature of webhelpers on html data that's being pulled in from an ATOM feed. If I just use a simple truncate, it might leave some html tags opened (like a div without a /div) which is Bad. I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover. I've found this: http://code.djangoproject.com/browser/django/trunk/django/utils/ text.py Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this. Perhaps some ElementTree or LXML wizard knows a quick hack? Thanks! I've had excellent luck stripping HTML with the following: http://www.aminus.net/browser/cleanhtml.py I use it to strip out all the html leaving a nice plain string. It does the best job of any solutions I've seen. TJ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote: On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote: I'd like to use something like the truncate feature of webhelpers on html data that's being pulled in from an ATOM feed. If I just use a simple truncate, it might leave some html tags opened (like a div without a /div) which is Bad. I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover. I've found this: http://code.djangoproject.com/browser/django/trunk/django/utils/text.py Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this. Perhaps some ElementTree or LXML wizard knows a quick hack? Thanks! I've had excellent luck stripping HTML with the following: http://www.aminus.net/browser/cleanhtml.py I use it to strip out all the html leaving a nice plain string. It does the best job of any solutions I've seen. TJ I think he just wants to make sure the HTML is well-formed, not strip the tags completely. However, strip_tags() is something WebHelpers should provide. I've noticed the lack a couple times. However, I'm not sure of the best implementation. - sgmllib: (used in cleanhtml.py): not in Python 3. Can cleanhtml.py be ported to HTMLParser? - lxml: hard to install on Mac and Windows due to C dependencies. - BeautifulSoup: has the best ability to parse real-world (i.e., misformed) HTML. However, it's a largish library so I'm not sure any helper should depend on it. - Simplicity vs speed. Would routines that depend only on the Python standard library be fast enough? As for Matt's case of truncating HTML without making it misformed, would this be widely enough used to justify making a webhelper for it? -- Mike Orr [EMAIL PROTECTED] --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
Mike Orr wrote: On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote: On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote: I'd like to use something like the truncate feature of webhelpers on html data that's being pulled in from an ATOM feed. If I just use a simple truncate, it might leave some html tags opened (like a div without a /div) which is Bad. I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover. I've found this: http://code.djangoproject.com/browser/django/trunk/django/utils/text.py Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this. Perhaps some ElementTree or LXML wizard knows a quick hack? Thanks! I've had excellent luck stripping HTML with the following: http://www.aminus.net/browser/cleanhtml.py I use it to strip out all the html leaving a nice plain string. It does the best job of any solutions I've seen. TJ I think he just wants to make sure the HTML is well-formed, not strip the tags completely. However, strip_tags() is something WebHelpers should provide. I've noticed the lack a couple times. However, I'm not sure of the best implementation. strip_tags should be easy enough to implement with some regexes -- you just have to remove .*?, then resolve any entities. This code does some fairly simplistic rendering of HTML (but better than what strip_tags would likely do), and might have a better home in WebHelpers: http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py -- Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Thu, Jun 5, 2008 at 1:01 PM, Ian Bicking [EMAIL PROTECTED] wrote: Mike Orr wrote: On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote: On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote: I'd like to use something like the truncate feature of webhelpers on html data that's being pulled in from an ATOM feed. If I just use a simple truncate, it might leave some html tags opened (like a div without a /div) which is Bad. I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover. I've found this: http://code.djangoproject.com/browser/django/trunk/django/utils/text.py Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this. Perhaps some ElementTree or LXML wizard knows a quick hack? Thanks! I've had excellent luck stripping HTML with the following: http://www.aminus.net/browser/cleanhtml.py I use it to strip out all the html leaving a nice plain string. It does the best job of any solutions I've seen. TJ I think he just wants to make sure the HTML is well-formed, not strip the tags completely. However, strip_tags() is something WebHelpers should provide. I've noticed the lack a couple times. However, I'm not sure of the best implementation. strip_tags should be easy enough to implement with some regexes -- you just have to remove .*?, then resolve any entities. This code does some fairly simplistic rendering of HTML (but better than what strip_tags would likely do), and might have a better home in WebHelpers: http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py Put in the WebHelpers unfinished directory and opened ticket #458 to integrate it. -- Mike Orr [EMAIL PROTECTED] --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---
Re: Truncating an html string safely
On Thu, Jun 5, 2008 at 5:03 PM, Mike Orr [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 1:01 PM, Ian Bicking [EMAIL PROTECTED] wrote: Mike Orr wrote: On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote: On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote: I'd like to use something like the truncate feature of webhelpers on html data that's being pulled in from an ATOM feed. If I just use a simple truncate, it might leave some html tags opened (like a div without a /div) which is Bad. I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover. I've found this: http://code.djangoproject.com/browser/django/trunk/django/utils/text.py Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this. Perhaps some ElementTree or LXML wizard knows a quick hack? Thanks! I've had excellent luck stripping HTML with the following: http://www.aminus.net/browser/cleanhtml.py I use it to strip out all the html leaving a nice plain string. It does the best job of any solutions I've seen. TJ I think he just wants to make sure the HTML is well-formed, not strip the tags completely. However, strip_tags() is something WebHelpers should provide. I've noticed the lack a couple times. However, I'm not sure of the best implementation. strip_tags should be easy enough to implement with some regexes -- you just have to remove .*?, then resolve any entities. This code does some fairly simplistic rendering of HTML (but better than what strip_tags would likely do), and might have a better home in WebHelpers: http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py Put in the WebHelpers unfinished directory and opened ticket #458 to integrate it. I have some boiler plate multi-threaded examples of using beautiful soup here: http://www-128.ibm.com/developerworks/aix/library/au-threadingpython/ --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups pylons-discuss group. To post to this group, send email to pylons-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pylons-discuss?hl=en -~--~~~~--~~--~--~---