[ 
https://issues.apache.org/jira/browse/TIKA-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145600#comment-16145600
 ] 

Tim Allison commented on TIKA-2440:
-----------------------------------

And another thing on the todo list, [~Takahiro], if you need similar handling 
of phonetic strings in .docx, please open another issue.  It looks like 
17.3.3.24 rt (Phonetic Guide Text) and 17.3.3.25 ruby (Phonetic Guide) in the 
ECMA OOXML spec is relevant.  I couldn't find anything about phonetic strings 
in .doc.  How about ppt or pptx?

Thank you!

> Phonetic strings handling for multilingual environments.
> --------------------------------------------------------
>
>                 Key: TIKA-2440
>                 URL: https://issues.apache.org/jira/browse/TIKA-2440
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Takahiro Ochi
>            Priority: Minor
>             Fix For: 1.17
>
>
> Hi there,
> I would like to propose an idea to improve phonetic strings handling for 
> multilingual environments. I believe Tika should not concatenate phonetic 
> strings because text with phonetic strings is recognized as noisy text in 
> most situations of natural language processing.
> Excel files include phonetic strings in some languages such as Japanese, 
> Chinese and so on. Apache POI concatenates phonetic strings onto the shared 
> strings when Tika extract text from Excel files.
> Recent Apache POI has an switch flag for phonetic strings concatination as 
> follows:
> https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)
> Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the 
> simple patch for my idea.
> {code:java}
> $ diff -ru XSSFExcelExtractorDecorator.java 
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> --- XSSFExcelExtractorDecorator.java    2017-06-10 19:13:33.355412625 +0900
> +++ 
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
>  2017-06-10 19:14:30.452411830 +0900
> @@ -130,7 +130,7 @@
>              styles = xssfReader.getStylesTable();
>              iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
> -            strings = new ReadOnlySharedStringsTable(container);
> +            strings = new ReadOnlySharedStringsTable(container,false);
>          } catch (InvalidFormatException e) {
>              throw new XmlException(e);
>          } catch (OpenXML4JException oe) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to