David Smiley created TIKA-2722:
----------------------------------
Summary: Don't call Date.toString (Possible issue with JDK 11)
Key: TIKA-2722
URL: https://issues.apache.org/jira/browse/TIKA-2722
Project: Tika
Issue Type: Bug
Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".
Reporter: David Smiley
I'm troubleshooting [a test failure in Apache
Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/]
"extracting" contrib that occurs in JDK 11 with locale "ar-EG". JDK 8 & 9
passes; I don't know about JDK 10. It has to do with extracting date metadata
from a PDF, particularly the created date but perhaps others too.
I stepped through the code into Tika and I think I've found out where the
troublesome code is. First note PDFParser line 271: {{addMetadata(metadata,
"created", info.getCreationDate());}}. That addMetadata overload variant will
call toString on a Date. IMO that's asking for trouble since the output of
that is Locale-dependent. I think that's okay to show to a user but not for
machine-to-machine information exchange. In the case of the test, it yielded
this odd looking date string:
Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
I pasted that in and it looks consistent with what I see in IntelliJ and in
Jenkins logs; hopefully will post correctly to JIRA. The odd part is the hour
& minutes relative to GMT. I won't be certain until after I click "Create".
Perhaps this problem is also indicative of a JDK 11 bug? Nevertheless I think
Tika should avoid calling Date.toString().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)