Re: Truncating an html string safely

2008-06-12 Thread Shannon -jj Behrens

On Sat, Jun 7, 2008 at 7:24 AM, Matt Feifarek [EMAIL PROTECTED] wrote:
 Oops; replied from the wrong address.

 -- Forwarded message --

 On Thu, Jun 5, 2008 at 2:36 PM, Ian Bicking [EMAIL PROTECTED] wrote:

 Well... it's hard to truncate exactly, as there's all that annoying
 nesting stuff.  An untested attempt with lxml:

 Exactly. Thanks for the lead.

 I'm not sure I'm up to the challenge, but if I do get it working, I'll get
 it back to you, in case it's good enough to be added to lxml (or whatever).

 Mike:
 Seems like if we have the truncate function in webhelpers, a truncate that
 handles html would be wise... since we're, err, making html, usually, with
 Pylons.

 Since the Django code doesn't seem to depend on anything (but some Django
 cruft, which seems to be frosting really) MAYBE it would be better to start
 with.

 But I'll poke around a bit today.

It would be fun to write a SAX handler that permits all tags, and
counts all characters.  It would stop permitting additional characters
once it reached a certain limit.

-jj

-- 
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-12 Thread rcs_comp



On Jun 12, 5:13 am, Mike Orr [EMAIL PROTECTED] wrote:

 Although again, we have two issues.  One is HTML-to-text (essentially
 lynx-as-a-function).  The other is truncating an HTML string while
 keeping it well-formed (which means not stopping in the middle of a
 tag and closing any open tags).


Here is another sanitizer (I think from something having to do with
Zope):

http://www.koders.com/python/fidFB51F4D2D89CC1397608213E09F11404D9B21059.aspx
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-12 Thread rcs_comp



On Jun 12, 5:13 am, Mike Orr [EMAIL PROTECTED] wrote:
 Although again, we have two issues.  One is HTML-to-text (essentially
 lynx-as-a-function).  The other is truncating an HTML string while
 keeping it well-formed (which means not stopping in the middle of a
 tag and closing any open tags).

Actually, I think we may have four issues...?

1) truncate HTML and end up with well-formed HTML.
2) strip all HTML tags (without an interest in text formatting)
3) html2text (trying to keep text formatting with p, block, etc.)
4) sanitizing HTML (not directly discussed here, but a good
implementation of this will be helpful, increase security, and should
be able to be extended trivially to provide #2, striping all HTML
tags).
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-07 Thread Matt Feifarek
Oops; replied from the wrong address.

-- Forwarded message --

On Thu, Jun 5, 2008 at 2:36 PM, Ian Bicking [EMAIL PROTECTED] wrote:


 Well... it's hard to truncate exactly, as there's all that annoying
 nesting stuff.  An untested attempt with lxml:


Exactly. Thanks for the lead.

I'm not sure I'm up to the challenge, but if I do get it working, I'll get
it back to you, in case it's good enough to be added to lxml (or whatever).

Mike:
Seems like if we have the truncate function in webhelpers, a truncate that
handles html would be wise... since we're, err, making html, usually, with
Pylons.

Since the Django code doesn't seem to depend on anything (but some Django
cruft, which seems to be frosting really) MAYBE it would be better to start
with.

But I'll poke around a bit today.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-05 Thread TJ Ninneman
On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote:

 I'd like to use something like the truncate feature of webhelpers  
 on html data that's being pulled in from an ATOM feed.

 If I just use a simple truncate, it might leave some html tags  
 opened (like a div without a /div) which is Bad.

 I figured that this was a common-enough task that I'd ask some  
 experts before trying to roll my own solution. It seems like the  
 kind of thing that might be hidden within the standard library  
 somewhere, below my nose, but outside of my ability to discover.

 I've found this:
 http://code.djangoproject.com/browser/django/trunk/django/utils/ 
 text.py

 Looks to be about the right thing, but I'd rather not be dependent  
 on all of Django to do this.

 Perhaps some ElementTree or LXML wizard knows a quick hack?

 Thanks!

 

I've had excellent luck stripping HTML with the following:

http://www.aminus.net/browser/cleanhtml.py

I use it to strip out all the html leaving a nice plain string.  It  
does the best job of any solutions I've seen.

TJ


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-05 Thread Mike Orr

On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote:
 On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote:

 I'd like to use something like the truncate feature of webhelpers on html
 data that's being pulled in from an ATOM feed.

 If I just use a simple truncate, it might leave some html tags opened (like
 a div without a /div) which is Bad.

 I figured that this was a common-enough task that I'd ask some experts
 before trying to roll my own solution. It seems like the kind of thing that
 might be hidden within the standard library somewhere, below my nose, but
 outside of my ability to discover.

 I've found this:
 http://code.djangoproject.com/browser/django/trunk/django/utils/text.py

 Looks to be about the right thing, but I'd rather not be dependent on all of
 Django to do this.

 Perhaps some ElementTree or LXML wizard knows a quick hack?

 Thanks!




 I've had excellent luck stripping HTML with the following:
 http://www.aminus.net/browser/cleanhtml.py
 I use it to strip out all the html leaving a nice plain string.  It does the
 best job of any solutions I've seen.

 TJ

I think he just wants to make sure the HTML is well-formed, not strip
the tags completely.  However, strip_tags() is something WebHelpers
should provide.  I've noticed the lack a couple times.  However, I'm
not sure of the best implementation.

- sgmllib: (used in cleanhtml.py): not in Python 3.  Can
cleanhtml.py be ported to HTMLParser?

- lxml: hard to install on Mac and Windows due to C dependencies.

- BeautifulSoup: has the best ability to parse real-world (i.e.,
misformed) HTML.  However, it's a largish library so I'm not sure any
helper should depend on it.

- Simplicity vs speed.  Would routines that depend only on the
Python standard library be fast enough?

As for Matt's case of truncating HTML without making it misformed,
would this be widely enough used to justify making a webhelper for it?

-- 
Mike Orr [EMAIL PROTECTED]

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-05 Thread Ian Bicking

Mike Orr wrote:
 On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote:
 On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote:

 I'd like to use something like the truncate feature of webhelpers on html
 data that's being pulled in from an ATOM feed.

 If I just use a simple truncate, it might leave some html tags opened (like
 a div without a /div) which is Bad.

 I figured that this was a common-enough task that I'd ask some experts
 before trying to roll my own solution. It seems like the kind of thing that
 might be hidden within the standard library somewhere, below my nose, but
 outside of my ability to discover.

 I've found this:
 http://code.djangoproject.com/browser/django/trunk/django/utils/text.py

 Looks to be about the right thing, but I'd rather not be dependent on all of
 Django to do this.

 Perhaps some ElementTree or LXML wizard knows a quick hack?

 Thanks!




 I've had excellent luck stripping HTML with the following:
 http://www.aminus.net/browser/cleanhtml.py
 I use it to strip out all the html leaving a nice plain string.  It does the
 best job of any solutions I've seen.

 TJ
 
 I think he just wants to make sure the HTML is well-formed, not strip
 the tags completely.  However, strip_tags() is something WebHelpers
 should provide.  I've noticed the lack a couple times.  However, I'm
 not sure of the best implementation.

strip_tags should be easy enough to implement with some regexes -- you 
just have to remove .*?, then resolve any entities.

This code does some fairly simplistic rendering of HTML (but better than 
what strip_tags would likely do), and might have a better home in 
WebHelpers:
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-05 Thread Mike Orr

On Thu, Jun 5, 2008 at 1:01 PM, Ian Bicking [EMAIL PROTECTED] wrote:

 Mike Orr wrote:
 On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED] wrote:
 On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote:

 I'd like to use something like the truncate feature of webhelpers on html
 data that's being pulled in from an ATOM feed.

 If I just use a simple truncate, it might leave some html tags opened (like
 a div without a /div) which is Bad.

 I figured that this was a common-enough task that I'd ask some experts
 before trying to roll my own solution. It seems like the kind of thing that
 might be hidden within the standard library somewhere, below my nose, but
 outside of my ability to discover.

 I've found this:
 http://code.djangoproject.com/browser/django/trunk/django/utils/text.py

 Looks to be about the right thing, but I'd rather not be dependent on all of
 Django to do this.

 Perhaps some ElementTree or LXML wizard knows a quick hack?

 Thanks!




 I've had excellent luck stripping HTML with the following:
 http://www.aminus.net/browser/cleanhtml.py
 I use it to strip out all the html leaving a nice plain string.  It does the
 best job of any solutions I've seen.

 TJ

 I think he just wants to make sure the HTML is well-formed, not strip
 the tags completely.  However, strip_tags() is something WebHelpers
 should provide.  I've noticed the lack a couple times.  However, I'm
 not sure of the best implementation.

 strip_tags should be easy enough to implement with some regexes -- you
 just have to remove .*?, then resolve any entities.

 This code does some fairly simplistic rendering of HTML (but better than
 what strip_tags would likely do), and might have a better home in
 WebHelpers:
 http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py

Put in the WebHelpers unfinished directory and opened ticket #458 to
integrate it.

-- 
Mike Orr [EMAIL PROTECTED]

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---



Re: Truncating an html string safely

2008-06-05 Thread Noah Gift
On Thu, Jun 5, 2008 at 5:03 PM, Mike Orr [EMAIL PROTECTED] wrote:


 On Thu, Jun 5, 2008 at 1:01 PM, Ian Bicking [EMAIL PROTECTED] wrote:
 
  Mike Orr wrote:
  On Thu, Jun 5, 2008 at 11:56 AM, TJ Ninneman [EMAIL PROTECTED]
 wrote:
  On Jun 5, 2008, at 12:59 PM, Matt Feifarek wrote:
 
  I'd like to use something like the truncate feature of webhelpers on
 html
  data that's being pulled in from an ATOM feed.
 
  If I just use a simple truncate, it might leave some html tags opened
 (like
  a div without a /div) which is Bad.
 
  I figured that this was a common-enough task that I'd ask some experts
  before trying to roll my own solution. It seems like the kind of thing
 that
  might be hidden within the standard library somewhere, below my nose,
 but
  outside of my ability to discover.
 
  I've found this:
 
 http://code.djangoproject.com/browser/django/trunk/django/utils/text.py
 
  Looks to be about the right thing, but I'd rather not be dependent on
 all of
  Django to do this.
 
  Perhaps some ElementTree or LXML wizard knows a quick hack?
 
  Thanks!
 
 
 
 
  I've had excellent luck stripping HTML with the following:
  http://www.aminus.net/browser/cleanhtml.py
  I use it to strip out all the html leaving a nice plain string.  It
 does the
  best job of any solutions I've seen.
 
  TJ
 
  I think he just wants to make sure the HTML is well-formed, not strip
  the tags completely.  However, strip_tags() is something WebHelpers
  should provide.  I've noticed the lack a couple times.  However, I'm
  not sure of the best implementation.
 
  strip_tags should be easy enough to implement with some regexes -- you
  just have to remove .*?, then resolve any entities.
 
  This code does some fairly simplistic rendering of HTML (but better than
  what strip_tags would likely do), and might have a better home in
  WebHelpers:
  http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py

 Put in the WebHelpers unfinished directory and opened ticket #458 to
 integrate it.


I have some boiler plate multi-threaded examples of using beautiful soup
here:

http://www-128.ibm.com/developerworks/aix/library/au-threadingpython/

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
pylons-discuss group.
To post to this group, send email to pylons-discuss@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~--~~~~--~~--~--~---