[jira] [Updated] (TIKA-3171) Could not read word doc which has superScripts and subscripts as htmlChunks?

Brahmaiah Doma (Jira) Mon, 17 Aug 2020 12:00:42 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brahmaiah Doma updated TIKA-3171:
---------------------------------
    Description: 
Could not read word doc which has superScripts and subscripts as htmlChunks?

i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with 
tika am getting text back as 32 + 23? 

 

Code sample i used :

 

ContentHandler handler = new BodyContentHandler();
 //this is Tika's parser, not one of our question parsers
 AutoDetectParser tikaParser = new AutoDetectParser();
 Metadata metadata = new Metadata();

try

{ InputStream stream = FileUtils.openInputStream(new File(".docX")); // The 
embedded extractor needs the file's InputStream wrapped in a TikaInputStream 
TikaInputStream tikaInputStream = TikaInputStream.get(stream); new 
OOXMLParser().parse(tikaInputStream, handler, metadata, new ParseContext()); 
String content = handler.toString(); System.out.println(content); }

catch (IOException | SAXException | TikaException e)

{ e.printStackTrace(); }

 

Let me know is there any other way i could get same text back from word even 
after parsing through Tika.

 

 

  was:
Could not read word doc which has superScripts and subscripts as htmlChunks?

i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing with 
tika am getting text back as 32 + 23? 

 

Code sample i used :

 

ContentHandler handler = new BodyContentHandler();
//this is Tika's parser, not one of our question parsers
AutoDetectParser tikaParser = new AutoDetectParser();
Metadata metadata = new Metadata();

try{
 InputStream stream = FileUtils.openInputStream(new 
File("/Users/bdoma/Downloads/questionimportwithexponents_NP.docx"));
 // The embedded extractor needs the file's InputStream wrapped in a 
TikaInputStream
 TikaInputStream tikaInputStream = TikaInputStream.get(stream);
 new OOXMLParser().parse(tikaInputStream, handler, metadata, new 
ParseContext());
 String content = handler.toString();
 System.out.println(content);
} catch (IOException | SAXException | TikaException e) {
 e.printStackTrace();
}

 

Let me know is there any other way i could get same text back from word even 
after parsing through Tika.

 

 


> Could not read word doc which has superScripts and subscripts as htmlChunks?
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-3171
>                 URL: https://issues.apache.org/jira/browse/TIKA-3171
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, parser
>    Affects Versions: 1.24.1
>            Reporter: Brahmaiah Doma
>            Priority: Major
>
> Could not read word doc which has superScripts and subscripts as htmlChunks?
> i have a document with text saying What is 3^2^ + 2^3^ ? but while parsing 
> with tika am getting text back as 32 + 23? 
>  
> Code sample i used :
>  
> ContentHandler handler = new BodyContentHandler();
>  //this is Tika's parser, not one of our question parsers
>  AutoDetectParser tikaParser = new AutoDetectParser();
>  Metadata metadata = new Metadata();
> try
> { InputStream stream = FileUtils.openInputStream(new File(".docX")); // The 
> embedded extractor needs the file's InputStream wrapped in a TikaInputStream 
> TikaInputStream tikaInputStream = TikaInputStream.get(stream); new 
> OOXMLParser().parse(tikaInputStream, handler, metadata, new ParseContext()); 
> String content = handler.toString(); System.out.println(content); }
> catch (IOException | SAXException | TikaException e)
> { e.printStackTrace(); }
>  
> Let me know is there any other way i could get same text back from word even 
> after parsing through Tika.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3171) Could not read word doc which has superScripts and subscripts as htmlChunks?

Reply via email to