[
https://issues.apache.org/jira/browse/TIKA-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brahmaiah Doma updated TIKA-3171:
---------------------------------
Description:
Could not read word doc which has superScripts and subscripts as htmlChunks?
i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with
tika am getting text back as 32 + 23?
Code sample i used :
ContentHandler handler = new BodyContentHandler();
//this is Tika's parser, not one of our question parsers
AutoDetectParser tikaParser = new AutoDetectParser();
Metadata metadata = new Metadata();
try
{ InputStream stream = FileUtils.openInputStream(new File(".docX")); // The
embedded extractor needs the file's InputStream wrapped in a TikaInputStream
TikaInputStream tikaInputStream = TikaInputStream.get(stream); new
OOXMLParser().parse(tikaInputStream, handler, metadata, new ParseContext());
String content = handler.toString(); System.out.println(content); }
catch (IOException | SAXException | TikaException e)
{ e.printStackTrace(); }
Let me know is there any other way i could get same text back from word even
after parsing through Tika.
was:
Could not read word doc which has superScripts and subscripts as htmlChunks?
i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with
tika am getting text back as 32 + 23?
Code sample i used :
ContentHandler handler = new BodyContentHandler();
//this is Tika's parser, not one of our question parsers
AutoDetectParser tikaParser = new AutoDetectParser();
Metadata metadata = new Metadata();
try{
InputStream stream = FileUtils.openInputStream(new
File("/Users/bdoma/Downloads/questionimportwithexponents_NP.docx"));
// The embedded extractor needs the file's InputStream wrapped in a
TikaInputStream
TikaInputStream tikaInputStream = TikaInputStream.get(stream);
new OOXMLParser().parse(tikaInputStream, handler, metadata, new
ParseContext());
String content = handler.toString();
System.out.println(content);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
Let me know is there any other way i could get same text back from word even
after parsing through Tika.
> Could not read word doc which has superScripts and subscripts as htmlChunks?
> ----------------------------------------------------------------------------
>
> Key: TIKA-3171
> URL: https://issues.apache.org/jira/browse/TIKA-3171
> Project: Tika
> Issue Type: Improvement
> Components: core, parser
> Affects Versions: 1.24.1
> Reporter: Brahmaiah Doma
> Priority: Major
>
> Could not read word doc which has superScripts and subscripts as htmlChunks?
> i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing
> with tika am getting text back as 32 + 23?
>
> Code sample i used :
>
> ContentHandler handler = new BodyContentHandler();
> //this is Tika's parser, not one of our question parsers
> AutoDetectParser tikaParser = new AutoDetectParser();
> Metadata metadata = new Metadata();
> try
> { InputStream stream = FileUtils.openInputStream(new File(".docX")); // The
> embedded extractor needs the file's InputStream wrapped in a TikaInputStream
> TikaInputStream tikaInputStream = TikaInputStream.get(stream); new
> OOXMLParser().parse(tikaInputStream, handler, metadata, new ParseContext());
> String content = handler.toString(); System.out.println(content); }
> catch (IOException | SAXException | TikaException e)
> { e.printStackTrace(); }
>
> Let me know is there any other way i could get same text back from word even
> after parsing through Tika.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)