I am passing to the parser the results I get from using
Text::Context::EitherSide. I can't pass in the entire file. Some of the
files are 60 MB and larger. My machine freezes and crashes if I do that.
I'll re-read the spec again and see what I come up with.
-- Craig
Thomas, Mark - BLS CTR wrote:
I'm using HTML::TokeParser to remove HTML. This functions
very well when tags are contained on one line.
I used to use TokeParser although now I prefer parsing HTML with XPath.
TokeParser is a stream parser--there aren't any problem with newlines.
What happens when you're reading a file line by line, and the
HTML tag is split between lines?
Don't do that.
The HTML I'm dealing with is sometimes broken
up. For instance, I'll get lines beginning with "size=2>"
which is the
end of a tag that began on the previous line.
Any suggestions or recommendation on cleaning up -- as in removing --
this kind of "broken" HTML?
It's not broken. The spec allows whitespace including newlines in the tags.
The proper way to parse HTML is with a specialized parser.
---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0536-1, 09/06/2005
Tested on: 9/6/2005 12:26:26 PM
avast! - copyright (c) 1988-2004 ALWIL Software.
http://www.avast.com
_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs