I need some help from the text processing gurus/script experts here. I have some XML that comes frma an Oracle (relevant to mention that?) DB that contains raw HTML (pre-XHTML) content. This makes the resulting XML non-well-formed. They didn't use tidy -asxml when they imported it or wrap it with <![CDATA[ raw ]] blocks.
For most of the files, I can just strip off the 3 line XML preamble and the trailing 3 line XML closing tags. And I am left with good HTML that I can process. But for some of the shorter ones they have all the content on one line between the inner most tag: Normal file: <root> <inner1> <content> <html> <body> .... </body> </html> </content> </inner1> </root> Shorter file: <root> <inner1> <content>A bunch of tesxt and <strong>fragmented</strong> HTML stuff with possible embedded newlines here<br></content> </inner1> </root> egrep -v works great in the former case. What can help me in the general case to pull unchanged everything from the <content> </content> tags regardless of lines? I don't think sed will work, and Awk looks tricky. Should I write a Perl script for this? Just ignore newlines and start diverting at the content> until I see a </content? Is there a better way? Robert, have you ever come across this in Perl BioInformatics? Aren't sequences sometimes in XML? Ed -- Ed Howland WDT Solutions, LLC. [EMAIL PROTECTED] (314) 962-0766 _______________________________________________ CWE-LUG mailing list [email protected] http://www.cwelug.org/ http://www.cwelug.org/archives/ http://www.cwelug.org/mailinglist/
