[
https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603979#comment-16603979
]
Nick Burch commented on TIKA-2722:
----------------------------------
Currently, Tika stores all metadata internally as Strings. For typed
properties, getters and setters will convert to/from the native types and the
strings, to eg let you get a {{Date}} back if you wanted it. (This also lets
you get all metadata irrespective of the type if you want. Other approaches for
storage have been suggested, none have won the argument to change just yet!)
For {{Date}} properties, there's a bunch of logic in Tika that tries to take
care of the formatting, thread safety etc. See
{{org.apache.tika.utils.DateUtils.formatDate}} for the full details. That
should all be going via {{String.format(Locale.Root, ....}} to avoid any issues
For PDFs specifically, for the well-known typed Date properties, we ought to be
getting a {{Calendar}} back from PDFBox, then getting a {{Date}} object from
that to set on the {{Metadata}} object, which then internally formats, no
{{toString}} calls. If you've found a case where that route isn't being
followed, a small PDF and possibly a unit test to show it would be great, so we
can fix that!
> Don't call Date.toString (Possible issue with JDK 11)
> -----------------------------------------------------
>
> Key: TIKA-2722
> URL: https://issues.apache.org/jira/browse/TIKA-2722
> Project: Tika
> Issue Type: Bug
> Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".
> Reporter: David Smiley
> Priority: Major
>
> I'm troubleshooting [a test failure in Apache
> Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/]
> "extracting" contrib that occurs in JDK 11 with locale "ar-EG". JDK 8 & 9
> passes; I don't know about JDK 10. It has to do with extracting date metadata
> from a PDF, particularly the created date but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the
> troublesome code is. First note PDFParser line 271: {{addMetadata(metadata,
> "created", info.getCreationDate());}}. That addMetadata overload variant
> will call toString on a Date. IMO that's asking for trouble since the output
> of that is Locale-dependent. I think that's okay to show to a user but not
> for machine-to-machine information exchange. In the case of the test, it
> yielded this odd looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in
> Jenkins logs; hopefully will post correctly to JIRA. The odd part is the
> hour & minutes relative to GMT. I won't be certain until after I click
> "Create".
> Perhaps this problem is also indicative of a JDK 11 bug? Nevertheless I
> think Tika should avoid calling Date.toString().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)