[cwe-lug] Parsing "bad" XML

Ed Howland Wed, 31 Aug 2005 09:31:44 -0700

I need some help from the text processing gurus/script experts here. I
have some XML that comes frma an Oracle (relevant to mention that?) DB
that contains raw HTML (pre-XHTML) content. This makes the resulting XML
non-well-formed. They didn't use tidy -asxml when they imported it or
wrap it with <![CDATA[ raw ]] blocks.


For most of the files, I can just strip off the 3 line XML preamble and
the trailing 3 line XML closing tags. And I am left with good HTML that
I can process. But for some of the shorter ones they have all the
content on one line between the inner most tag:

Normal file:
<root>
<inner1>
<content>
<html>
<body>
....
</body>
</html>
</content>
</inner1>
</root>

Shorter file:
<root>
<inner1>
<content>A bunch of tesxt and <strong>fragmented</strong> HTML stuff
with possible embedded
newlines here<br></content>
</inner1>
</root>

egrep -v works great in the former case. What can help me in the general
case to pull unchanged everything from
the <content> </content> tags regardless of lines? I don't think sed
will work, and Awk looks tricky. Should I write a Perl script for this?
Just ignore newlines and start diverting at the content> until I see a
</content? Is there a better way?

Robert, have you ever come across this in Perl BioInformatics? Aren't
sequences sometimes in XML?

Ed

-- 
Ed Howland
WDT Solutions, LLC.
[EMAIL PROTECTED]
(314) 962-0766

 
_______________________________________________
CWE-LUG mailing list
[email protected]
http://www.cwelug.org/
http://www.cwelug.org/archives/
http://www.cwelug.org/mailinglist/

[cwe-lug] Parsing "bad" XML

Reply via email to