[
https://issues.apache.org/jira/browse/TIKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542418#comment-15542418
]
Tim Allison commented on TIKA-2108:
-----------------------------------
Thank you for opening this. Is it safe to split on comma/semi-colon for every
value? This gives me pause.
Should we limit it to metadata values with cardinality {{0..>1}} in the
[spec|http://www.iptc.org/std/photometadata/specification/IPTC-PhotoMetadata],
e.g. key words, phone numbers, email addresses, urls, etc.?
[~rgauss], any recommendations?
> Non-semantic extraction of supplemental categories
> --------------------------------------------------
>
> Key: TIKA-2108
> URL: https://issues.apache.org/jira/browse/TIKA-2108
> Project: Tika
> Issue Type: Bug
> Environment: tika-app-1.13.jar, tika-server-1.13.jar
> Reporter: Xavier Perseguers
> Priority: Minor
>
> When extracting metadata for a file with categories, the comma (or semi-colon
> - according to the IPTC specification) -separated list of categories is
> extracted as a blank space-separated list of terms.
> Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg
> {code}
> java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
> <meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la
> Poya"/>
> {code}
> When using exiftool:
> {code}
> exiftool sample.jpg | grep Categories
> Supplemental Categories : Fribourg, Cathédrale, Pont de la Poya
> {code}
> This is not a problem when using Tika with, say, Solr since stopwords such as
> "de" and "la" will be dropped. However, when using Tika standalone with an
> external tool, there is no way to fetch the *actual* list of categories,
> namely:
> * Fribourg
> * Cathédrale
> * Pont de la Poya
> where blank spaces in the title may be wanted instead of getting single-term
> topic names.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)