Solrians,
We have a request to drop phonetic strings from xlsx as the default in Tika.
I'm not familiar enough with Japanese to know if users would generally expect
to be able to search on these as well as the original. The current practice is
to include them.
Any recommendations? Thank you!
Best,
Tim
-----Original Message-----
From: Takahiro Ochi (JIRA) [mailto:[email protected]]
Sent: Tuesday, August 8, 2017 2:28 AM
To: [email protected]
Subject: [jira] [Created] (TIKA-2440) Phonetic strings handling for
multilingual environments.
Takahiro Ochi created TIKA-2440:
-----------------------------------
Summary: Phonetic strings handling for multilingual environments.
Key: TIKA-2440
URL: https://issues.apache.org/jira/browse/TIKA-2440
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Takahiro Ochi
Priority: Minor
Hi there,
I would like to propose an idea to improve phonetic strings handling for
multilingual environments. I believe Tika should not concatenate phonetic
strings because text with phonetic strings is recognized as noisy text in most
situations of natural language processing.
Excel files include phonetic strings in some languages such as Japanese,
Chinese and so on. Apache POI concatenates phonetic strings onto the shared
strings when Tika extract text from Excel files.
Recent Apache POI has an switch flag for phonetic strings concatination as
follows:
https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)
Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the
simple patch for my idea.
{code:java}
$ diff -ru XSSFExcelExtractorDecorator.java
./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
--- XSSFExcelExtractorDecorator.java 2017-06-10 19:13:33.355412625 +0900
+++
./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
2017-06-10 19:14:30.452411830 +0900
@@ -130,7 +130,7 @@
styles = xssfReader.getStylesTable();
iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
- strings = new ReadOnlySharedStringsTable(container);
+ strings = new ReadOnlySharedStringsTable(container,false);
} catch (InvalidFormatException e) {
throw new XmlException(e);
} catch (OpenXML4JException oe) {
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)