[
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412012#comment-17412012
]
Tim Allison edited comment on TIKA-3544 at 9/8/21, 3:40 PM:
------------------------------------------------------------
In TIKA-2025 (which is nearly exactly this issue), we added a custom
TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat. This
broke Excel and POI's default handling to allow up to 15 digits to be
extracted.
When I look at the underlying xml of the attached file, 6480195344642780 is, in
fact, stored there. If we bump our custom handling to 16 digits this problem
would be solved _for this file_ and for numbers with 16 digits.
As Tilman and Nick note, though, Excel is really bad for numbers that might
start with leading zeros, like credit card #s, etc. You have to be really
careful to enter them as strings or, better yet, use an actual database.
was (Author: [email protected]):
In TIKA-2025 (which is nearly exactly this issue), we added a custom
TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat. This
broke Excel and POI's default handling to allow up to 15 digits to be
extracted.
When I look at the underlying xml, 6480195344642780 is, in fact, stored there.
If we bump our custom handling to 16 digits this problem would be solved _for
this file_ and for numbers with 16 digits.
As Tilman and Nick note, though, Excel is really bad for numbers that might
start with leading zeros, like credit card #s, etc. You have to be really
careful to enter them as strings or, better yet, use an actual database.
> Extraction of long sequences of digits from Excel spreadsheets using Tika
> 1.20 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.20
> Reporter: Jitin Jindal
> Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the
> attached spreadsheet as 6.480195344642784E15, which clearly is not the
> desired output.
> I think the impact of this issue is significant. There’s plenty of
> information that can no longer be reliably extracted from spreadsheets. Think
> credit card numbers, telephone numbers and product identifiers to name a few.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)