Re: Scan data for XML invalid characters and parse articles

Adam Turoff Wed, 13 Feb 2002 08:34:20 -0800

On Wed, Feb 13, 2002 at 08:40:14AM -0800, John wrote:
> I have a scalar variable containing HTML that needs to be converted 
> to XML.  It's not the best HTML so it has invalid characters (like 
> smart quotes, 1/2 character, etc.).  I need to determine if these 
> characters exist in the data and throw an error if they do.  What 
> is the best way to do this?  I can't use an XML parser because it's 
> not really XML.
> 
> Also, if I have a block of text like this:
> 
> <!-- begin article1 title -->title1<!-- end article1 -->
> <!-- begin article1 body -->body1<!-- end article1 body -->
> ...
> <!-- begin articleN title -->titleN<!-- end articleN title>
> <!-- begin articleN body -->bodyN<!-- end articleN body -->
> 
> Where the ... means there could be some number of articles (less 
> than 5), can anyone think of a relatively simple regex (I mean I 
> don't want to have article1, article2, etc. hard-coded in the regex) 
> that will extract the titles and bodies?


Perhaps you're going about it the wrong way.  Why not make three
simple passes on the input file instead of throwing a lot of 
complexity into the regex?

        - scrub "bad characters" (curly quotes), translating as necessary
          (a simple tr/// should suffice here)
        - use HTML Tidy to convert to XML
        - use your favorite XML parser on the result.

HTML tidy can be found at w3.org (check google for the exact URL)
and will tidy up HTML pages, but can also convert HTML into an
valid XML document.  

Alternatively, you may wish to skip this process entirely and use the
HTML Parser in XML::LibXML directly.

Z.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scan data for XML invalid characters and parse articles

Reply via email to