Brahmaiah Doma created TIKA-3171:
------------------------------------

             Summary: Could not read word doc which has superScripts and 
subscripts as htmlChunks?
                 Key: TIKA-3171
                 URL: https://issues.apache.org/jira/browse/TIKA-3171
             Project: Tika
          Issue Type: Improvement
          Components: core, parser
            Reporter: Brahmaiah Doma


Could not read word doc which has superScripts and subscripts as htmlChunks?

i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with 
tika am getting text back as 32 + 23? 

 

Code sample i used :

 

ContentHandler handler = new BodyContentHandler();
//this is Tika's parser, not one of our question parsers
AutoDetectParser tikaParser = new AutoDetectParser();
Metadata metadata = new Metadata();

try{
 InputStream stream = FileUtils.openInputStream(new 
File("/Users/bdoma/Downloads/questionimportwithexponents_NP.docx"));
 // The embedded extractor needs the file's InputStream wrapped in a 
TikaInputStream
 TikaInputStream tikaInputStream = TikaInputStream.get(stream);
 new OOXMLParser().parse(tikaInputStream, handler, metadata, new 
ParseContext());
 String content = handler.toString();
 System.out.println(content);
} catch (IOException | SAXException | TikaException e) {
 e.printStackTrace();
}

 

Let me know is there any other way i could get same text back from word even 
after parsing through Tika.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to