[
https://issues.apache.org/jira/browse/TIKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542427#comment-15542427
]
Xavier Perseguers commented on TIKA-2108:
-----------------------------------------
>From what I see in exiftool and by using the IPTC editor in Adobe Photoshop,
>yes, it shows that those categories may be split by either comma or semi-colon
>and as such those 2 characters are real delimiters and cannot be used as part
>of a category name.
> Non-semantic extraction of supplemental categories
> --------------------------------------------------
>
> Key: TIKA-2108
> URL: https://issues.apache.org/jira/browse/TIKA-2108
> Project: Tika
> Issue Type: Bug
> Environment: tika-app-1.13.jar, tika-server-1.13.jar
> Reporter: Xavier Perseguers
> Priority: Minor
>
> When extracting metadata for a file with categories, the comma (or semi-colon
> - according to the IPTC specification) -separated list of categories is
> extracted as a blank space-separated list of terms.
> Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg
> {code}
> java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
> <meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la
> Poya"/>
> {code}
> When using exiftool:
> {code}
> exiftool sample.jpg | grep Categories
> Supplemental Categories : Fribourg, Cathédrale, Pont de la Poya
> {code}
> This is not a problem when using Tika with, say, Solr since stopwords such as
> "de" and "la" will be dropped. However, when using Tika standalone with an
> external tool, there is no way to fetch the *actual* list of categories,
> namely:
> * Fribourg
> * Cathédrale
> * Pont de la Poya
> where blank spaces in the title may be wanted instead of getting single-term
> topic names.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)