Re: HTML::TokeParser and tags split between lines

Craig Cardimon Tue, 06 Sep 2005 10:44:09 -0700

I am passing to the parser the results I get from usingText::Context::EitherSide. I can't pass in the entire file. Some of thefiles are 60 MB and larger. My machine freezes and crashes if I do that.I'll re-read the spec again and see what I come up with.


-- Craig


Thomas, Mark - BLS CTR wrote:

I'm using HTML::TokeParser to remove HTML. This functionsvery well when tags are contained on one line.
I used to use TokeParser although now I prefer parsing HTML with XPath.
TokeParser is a stream parser--there aren't any problem with newlines.
What happens when you're reading a file line by line, and theHTML tag is split between lines?
Don't do that.
The HTML I'm dealing with is sometimes brokenup. For instance, I'll get lines beginning with "size=2>"which is theend of a tag that began on the previous line.
Any suggestions or recommendation on cleaning up -- as in removing --this kind of "broken" HTML?
It's not broken. The spec allows whitespace including newlines in the tags.
The proper way to parse HTML is with a specialized parser.



---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0536-1, 09/06/2005
Tested on: 9/6/2005 12:26:26 PM
avast! - copyright (c) 1988-2004 ALWIL Software.
http://www.avast.com



_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: HTML::TokeParser and tags split between lines

Reply via email to