[ 
https://issues.apache.org/jira/browse/TIKA-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145477#comment-16145477
 ] 

Tim Allison commented on TIKA-2440:
-----------------------------------

Got it.  Thank you!

# I made it configurable to turn off phonetic runs in xlsx.  
# I added extraction of phonetic runs in xls.  (we weren't doing it before, and 
this will be new behavior, but I want the behavior to be the same btwn xls and 
xlsx)
# I used the same configuration mechanism to allow users to turn off extraction 
of phonetic runs.  

See the new unit tests or this tika-config.xml for how to turn off phonetic 
runs: 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-exclude-phonetic.xml

Still on the todo list is to deal with phonetic runs in XLSB...I think?

> Phonetic strings handling for multilingual environments.
> --------------------------------------------------------
>
>                 Key: TIKA-2440
>                 URL: https://issues.apache.org/jira/browse/TIKA-2440
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Takahiro Ochi
>            Priority: Minor
>
> Hi there,
> I would like to propose an idea to improve phonetic strings handling for 
> multilingual environments. I believe Tika should not concatenate phonetic 
> strings because text with phonetic strings is recognized as noisy text in 
> most situations of natural language processing.
> Excel files include phonetic strings in some languages such as Japanese, 
> Chinese and so on. Apache POI concatenates phonetic strings onto the shared 
> strings when Tika extract text from Excel files.
> Recent Apache POI has an switch flag for phonetic strings concatination as 
> follows:
> https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)
> Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the 
> simple patch for my idea.
> {code:java}
> $ diff -ru XSSFExcelExtractorDecorator.java 
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> --- XSSFExcelExtractorDecorator.java    2017-06-10 19:13:33.355412625 +0900
> +++ 
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
>  2017-06-10 19:14:30.452411830 +0900
> @@ -130,7 +130,7 @@
>              styles = xssfReader.getStylesTable();
>              iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
> -            strings = new ReadOnlySharedStringsTable(container);
> +            strings = new ReadOnlySharedStringsTable(container,false);
>          } catch (InvalidFormatException e) {
>              throw new XmlException(e);
>          } catch (OpenXML4JException oe) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to