[
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710
]
Tilman Hausherr edited comment on TIKA-3544 at 9/8/21, 6:15 AM:
----------------------------------------------------------------
It seems to depend on the value (this output done with 2.1.1):
{noformat}
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="extended-properties:AppVersion" content="16.0300"/>
<meta name="protected" content="false"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="meta:last-author" content="Jitin Jindal"/>
<meta name="X-TIKA:digest:SHA256"
content="7d1109045508e7fdc0148d9e9e7b16d01ce18ae0794f7381145e23973996c0b6"/>
<meta name="extended-properties:DocSecurityString" content="None"/>
<meta name="resourceName" content="Credit Card Numbers.xlsx"/>
<meta name="dcterms:modified" content="2021-09-07T20:57:34Z"/>
<meta name="Content-Length" content="500481"/>
<meta name="X-TIKA:digest:MD5" content="72c4c6777f1f9144542ddf5a059d2ffa"/>
<meta name="Content-Type"
content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<title/>
</head>
<body><div><h1>Payments - Payment Details</h1>
<table><tbody><tr> <td>Payment Details</td></tr>
<tr> <td>Credit Card Numbers (Source:
http://www.getcreditcardnumbers.com/)</td></tr>
<tr> <td>6,48019534464278E+15</td></tr>
<tr> <td>30295201231669</td></tr>
<tr> <td>30082494556063</td></tr>
<tr> <td>344850003945824</td></tr>
<tr> <td>3,58338792333363E+15</td></tr>
<tr> <td>3,58738537059364E+15</td></tr>
<tr/>
</tbody></table>
<p>&"Helvetica,Regular"&12&K000000&P </p>
<a
href="http://www.getcreditcardnumbers.com/">http://www.getcreditcardnumbers.com/</a></div>
</body></html>
{noformat}
was (Author: tilman):
It seems to depend on the value:
{noformat}
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="extended-properties:AppVersion" content="16.0300"/>
<meta name="protected" content="false"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="meta:last-author" content="Jitin Jindal"/>
<meta name="X-TIKA:digest:SHA256"
content="7d1109045508e7fdc0148d9e9e7b16d01ce18ae0794f7381145e23973996c0b6"/>
<meta name="extended-properties:DocSecurityString" content="None"/>
<meta name="resourceName" content="Credit Card Numbers.xlsx"/>
<meta name="dcterms:modified" content="2021-09-07T20:57:34Z"/>
<meta name="Content-Length" content="500481"/>
<meta name="X-TIKA:digest:MD5" content="72c4c6777f1f9144542ddf5a059d2ffa"/>
<meta name="Content-Type"
content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<title/>
</head>
<body><div><h1>Payments - Payment Details</h1>
<table><tbody><tr> <td>Payment Details</td></tr>
<tr> <td>Credit Card Numbers (Source:
http://www.getcreditcardnumbers.com/)</td></tr>
<tr> <td>6,48019534464278E+15</td></tr>
<tr> <td>30295201231669</td></tr>
<tr> <td>30082494556063</td></tr>
<tr> <td>344850003945824</td></tr>
<tr> <td>3,58338792333363E+15</td></tr>
<tr> <td>3,58738537059364E+15</td></tr>
<tr/>
</tbody></table>
<p>&"Helvetica,Regular"&12&K000000&P </p>
<a
href="http://www.getcreditcardnumbers.com/">http://www.getcreditcardnumbers.com/</a></div>
</body></html>
{noformat}
> Extraction of long sequences of digits from Excel spreadsheets using Tika
> 1.20 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.20
> Reporter: Jitin Jindal
> Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the
> attached spreadsheet as 6.480195344642784E15, which clearly is not the
> desired output.
> I think the impact of this issue is significant. There’s plenty of
> information that can no longer be reliably extracted from spreadsheets. Think
> credit card numbers, telephone numbers and product identifiers to name a few.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)