[ 
https://issues.apache.org/jira/browse/TIKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542427#comment-15542427
 ] 

Xavier Perseguers commented on TIKA-2108:
-----------------------------------------

>From what I see in exiftool and by using the IPTC editor in Adobe Photoshop, 
>yes, it shows that those categories may be split by either comma or semi-colon 
>and as such those 2 characters are real delimiters and cannot be used as part 
>of a category name.

> Non-semantic extraction of supplemental categories
> --------------------------------------------------
>
>                 Key: TIKA-2108
>                 URL: https://issues.apache.org/jira/browse/TIKA-2108
>             Project: Tika
>          Issue Type: Bug
>         Environment: tika-app-1.13.jar, tika-server-1.13.jar
>            Reporter: Xavier Perseguers
>            Priority: Minor
>
> When extracting metadata for a file with categories, the comma (or semi-colon 
> - according to the IPTC specification) -separated list of categories is 
> extracted as a blank space-separated list of terms.
> Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg
> {code}
> java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
> <meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la 
> Poya"/>
> {code}
> When using exiftool:
> {code}
> exiftool sample.jpg  | grep Categories
> Supplemental Categories         : Fribourg, Cathédrale, Pont de la Poya
> {code}
> This is not a problem when using Tika with, say, Solr since stopwords such as 
> "de" and "la" will be dropped. However, when using Tika standalone with an 
> external tool, there is no way to fetch the *actual* list of categories, 
> namely:
> * Fribourg
> * Cathédrale
> * Pont de la Poya
> where blank spaces in the title may be wanted instead of getting single-term 
> topic names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to