Mike Orr wrote:
> On Thu, Jun 12, 2008 at 7:55 AM, rcs_comp <[EMAIL PROTECTED]> wrote:
>> On Jun 12, 5:13 am, "Mike Orr" <[EMAIL PROTECTED]> wrote:
>>> Although again, we have two issues.  One is HTML-to-text (essentially
>>> lynx-as-a-function).  The other is truncating an HTML string while
>>> keeping it well-formed (which means not stopping in the middle of a
>>> tag and closing any open tags).
>> Actually, I think we may have four issues...?
>>
>> 1) truncate HTML and end up with well-formed HTML.
> 
> I agree with you; I'm not convinced this is a broad enough need to
> warrant a webhelper.  But some significant use cases would help
> convince me.
> 
>> 2) strip all HTML tags (without an interest in text formatting)
>> 3) html2text (trying to keep text formatting with p, block, etc.)
> 
> Ian's code handles p and div, and treats block as p.  Other tags are
> stripped and ignored.  We can extend it if we want more sophistocated
> formatting.  Actually, indented blocks would be useful.  And
> optionally displaying the hrefs.  (Lynx does this with footnotes.)

I think blockquote might be handled, and anchors do show their links. 
Possibly also lists?

The code was originally written for creating text alternatives to HTML 
email.

>> 4) sanitizing HTML (not directly discussed here, but a good
>> implementation of this will be helpful, increase security, and should
>> be able to be extended trivially to provide #2, striping all HTML
>> tags).
> 
> What exactly do you mean by sanitizing?  Stripping all except a few
> formatting tags?  This would be good for WebHelpers if somebody can
> provide an implementation.  One not depending on non-stdlib packages.

I think Jon Rosebaugh (aka Chairos) ported lxml.html.clean to 
BeautifulSoup.  You couldn't do it without some kind of HTML parser, but 
BS is an easy install (or even include it, it's just one file).

feedparser also includes a cleaner, but IMHO it's a bit more crude.

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to