[
https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389474#comment-15389474
]
Hudson commented on TIKA-2025:
------------------------------
FAILURE: Integrated in tika-2.x #122 (See
[https://builds.apache.org/job/tika-2.x/122/])
TIKA-2025 increase number of significant digits extracted in "general"
(tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34)
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* CHANGES.txt
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java
*
tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java
*
tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> Extraction of long sequences of digits from Excel spreadsheets using Tika
> 1.13 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-2025
> URL: https://issues.apache.org/jira/browse/TIKA-2025
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Aeham Abushwashi
> Assignee: Tim Allison
> Fix For: 2.0, 1.14
>
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “340229177292566” is extracted from the
> attached spreadsheet as 3.40229E+14, which clearly is not the desired output.
> This works as expected in 1.12 and earlier. I suspect POI’s recent use of
> org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.
> I think the impact of this issue is significant. There’s plenty of
> information that can no longer be reliably extracted from spreadsheets. Think
> credit card numbers, telephone numbers and product identifiers to name a few.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)