[ https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison reassigned TIKA-2025: --------------------------------- Assignee: Tim Allison > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.13 doesn’t yield the expected results > ----------------------------------------------------------------------------------------------------------------- > > Key: TIKA-2025 > URL: https://issues.apache.org/jira/browse/TIKA-2025 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.13 > Reporter: Aeham Abushwashi > Assignee: Tim Allison > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “340229177292566” is extracted from the > attached spreadsheet as 3.40229E+14, which clearly is not the desired output. > This works as expected in 1.12 and earlier. I suspect POI’s recent use of > org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian JIRA (v6.3.4#6332)