Brahmaiah Doma created TIKA-3171:
------------------------------------
Summary: Could not read word doc which has superScripts and
subscripts as htmlChunks?
Key: TIKA-3171
URL: https://issues.apache.org/jira/browse/TIKA-3171
Project: Tika
Issue Type: Improvement
Components: core, parser
Reporter: Brahmaiah Doma
Could not read word doc which has superScripts and subscripts as htmlChunks?
i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with
tika am getting text back as 32 + 23?
Code sample i used :
ContentHandler handler = new BodyContentHandler();
//this is Tika's parser, not one of our question parsers
AutoDetectParser tikaParser = new AutoDetectParser();
Metadata metadata = new Metadata();
try{
InputStream stream = FileUtils.openInputStream(new
File("/Users/bdoma/Downloads/questionimportwithexponents_NP.docx"));
// The embedded extractor needs the file's InputStream wrapped in a
TikaInputStream
TikaInputStream tikaInputStream = TikaInputStream.get(stream);
new OOXMLParser().parse(tikaInputStream, handler, metadata, new
ParseContext());
String content = handler.toString();
System.out.println(content);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
Let me know is there any other way i could get same text back from word even
after parsing through Tika.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)