Hello, how are you people?

I need do extract text from word, ppt, pps, xls documents. This is working
fine, but when POI finds an image, graphic or other object  embedded, the
string is appendded with a EMBED "tag". This is happening for
"WordExtractor", "HWPFDocument" for while.

My problem: I need to create a XML file with a summary of the text to show
in the result page (software structure) and the XML parser can't validate
this tags because of the strange characters. There is a way to not include
this in the text extraction?

Ex.:
TextExtraction:
           POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
           HWPFDocument document = new HWPFDocument(fileSystem);
           Range range = document.getRange();
           for (int i = 0; i < range.numParagraphs(); i++)
           {
               Paragraph paragraph = range.getParagraph(i);
               wordDocText.append(paragraph.text());
           }
           System.out.println(wordDocText.toString());
Result (the strange characters dont show in the email body...):
  -->     EMBED Word.Picture.8  
           Documento de Projeto
           Manual do Usuário
           Web Publication
  -->     EMBED CorelDraw.Graphic.9  


StackTrace from the Parser:
Caused by: net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException:
An invalid XML character (Unicode: 0x14) was found in the CDATA section.


Thanks people! Any help is useful,

--
Fernando Bernardino

Reply via email to