I am passing to the parser the results I get from using Text::Context::EitherSide. I can't pass in the entire file. Some of the files are 60 MB and larger. My machine freezes and crashes if I do that. I'll re-read the spec again and see what I come up with.

-- Craig

Thomas, Mark - BLS CTR wrote:
I'm using HTML::TokeParser to remove HTML. This functions very well when tags are contained on one line.


I used to use TokeParser although now I prefer parsing HTML with XPath.
TokeParser is a stream parser--there aren't any problem with newlines.


What happens when you're reading a file line by line, and the HTML tag is split between lines?


Don't do that.


The HTML I'm dealing with is sometimes broken up. For instance, I'll get lines beginning with "size=2>" which is the end of a tag that began on the previous line.

Any suggestions or recommendation on cleaning up -- as in removing -- this kind of "broken" HTML?


It's not broken. The spec allows whitespace including newlines in the tags.
The proper way to parse HTML is with a specialized parser.


---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0536-1, 09/06/2005
Tested on: 9/6/2005 12:26:26 PM
avast! - copyright (c) 1988-2004 ALWIL Software.
http://www.avast.com



_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to