Aeham Abushwashi created TIKA-2025:
--------------------------------------
Summary: Extraction of long sequences of digits from Excel
spreadsheets using Tika 1.13 doesn’t yield the expected results
Key: TIKA-2025
URL: https://issues.apache.org/jira/browse/TIKA-2025
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.13
Reporter: Aeham Abushwashi
If an Excel spreadsheet contains a long sequence of digits, such as a credit
card number, Tika 1.13 will emit the said sequence in scientific notation.
For example, the credit card number “340229177292566” is extracted from the
attached spreadsheet as 3.40229E+14, which clearly is not the desired output.
This works as expected in 1.12 and earlier. I suspect POI’s recent use of
org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.
I think the impact of this issue is significant. There’s plenty of information
that can no longer be reliably extracted from spreadsheets. Think credit card
numbers, telephone numbers and product identifiers to name a few.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)