[
https://issues.apache.org/jira/browse/TIKA-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145595#comment-16145595
]
Hudson commented on TIKA-2440:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1350 (See
[https://builds.apache.org/job/Tika-trunk/1350/])
TIKA-2440 -- extract phonetic runs from xls and allow users to turn off
(tallison:
[https://github.com/apache/tika/commit/74574e38cb2f53395c185742928052d17b181b5a])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* (add)
tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-exclude-phonetic.xml
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_phonetic.xls
* (add) tika-parsers/src/test/resources/test-documents/testEXCEL_phonetic.xlsx
* (edit) CHANGES.txt
> Phonetic strings handling for multilingual environments.
> --------------------------------------------------------
>
> Key: TIKA-2440
> URL: https://issues.apache.org/jira/browse/TIKA-2440
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Takahiro Ochi
> Priority: Minor
>
> Hi there,
> I would like to propose an idea to improve phonetic strings handling for
> multilingual environments. I believe Tika should not concatenate phonetic
> strings because text with phonetic strings is recognized as noisy text in
> most situations of natural language processing.
> Excel files include phonetic strings in some languages such as Japanese,
> Chinese and so on. Apache POI concatenates phonetic strings onto the shared
> strings when Tika extract text from Excel files.
> Recent Apache POI has an switch flag for phonetic strings concatination as
> follows:
> https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)
> Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the
> simple patch for my idea.
> {code:java}
> $ diff -ru XSSFExcelExtractorDecorator.java
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> --- XSSFExcelExtractorDecorator.java 2017-06-10 19:13:33.355412625 +0900
> +++
> ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> 2017-06-10 19:14:30.452411830 +0900
> @@ -130,7 +130,7 @@
> styles = xssfReader.getStylesTable();
> iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
> - strings = new ReadOnlySharedStringsTable(container);
> + strings = new ReadOnlySharedStringsTable(container,false);
> } catch (InvalidFormatException e) {
> throw new XmlException(e);
> } catch (OpenXML4JException oe) {
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)