[ 
https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355641#comment-15355641
 ] 

Tim Allison edited comment on TIKA-2025 at 6/29/16 6:55 PM:
------------------------------------------------------------

bq. If an Excel spreadsheet contains a long sequence of digits, such as a 
credit card number,

Not quite.  As Javen O'Neal pointed out, you can't store more than 15 digit 
numbers as numerals in Excel.  So, you'd never be able to store a 16 digit 
credit card number as a number; it would be stored as text, and then this 
wouldn't be a problem.

The issue/change of behavior still holds for <16 digit numbers, and we need to 
find a workaround.


was (Author: [email protected]):
bq. If an Excel spreadsheet contains a long sequence of digits, such as a 
credit card number,

Not quite.  As Javen O'Neal pointed out, you can't store more than 15 digit 
numbers as numerals in Excel.  So, you'd never be able to store a 16 digit 
credit card number as a number; it would be stored as text, and then this 
wouldn't be a problem.

The issue/change of behavior still holds for <16 digit numbers.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.13 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2025
>                 URL: https://issues.apache.org/jira/browse/TIKA-2025
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Assignee: Tim Allison
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “340229177292566” is extracted from the 
> attached spreadsheet as 3.40229E+14, which clearly is not the desired output. 
> This works as expected in 1.12 and earlier. I suspect POI’s recent use of 
> org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to