[jira] [Commented] (TIKA-2108) Non-semantic extraction of supplemental categories

Tim Allison (JIRA) Mon, 03 Oct 2016 06:27:16 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542418#comment-15542418
 ]


Tim Allison commented on TIKA-2108:
-----------------------------------

Thank you for opening this.  Is it safe to split on comma/semi-colon for every 
value?  This gives me pause.

Should we limit it to metadata values with cardinality {{0..>1}} in the 
[spec|http://www.iptc.org/std/photometadata/specification/IPTC-PhotoMetadata], 
e.g. key words, phone numbers, email addresses, urls, etc.?

[~rgauss], any recommendations?

> Non-semantic extraction of supplemental categories
> --------------------------------------------------
>
>                 Key: TIKA-2108
>                 URL: https://issues.apache.org/jira/browse/TIKA-2108
>             Project: Tika
>          Issue Type: Bug
>         Environment: tika-app-1.13.jar, tika-server-1.13.jar
>            Reporter: Xavier Perseguers
>            Priority: Minor
>
> When extracting metadata for a file with categories, the comma (or semi-colon 
> - according to the IPTC specification) -separated list of categories is 
> extracted as a blank space-separated list of terms.
> Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg
> {code}
> java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
> <meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la 
> Poya"/>
> {code}
> When using exiftool:
> {code}
> exiftool sample.jpg  | grep Categories
> Supplemental Categories         : Fribourg, Cathédrale, Pont de la Poya
> {code}
> This is not a problem when using Tika with, say, Solr since stopwords such as 
> "de" and "la" will be dropped. However, when using Tika standalone with an 
> external tool, there is no way to fetch the *actual* list of categories, 
> namely:
> * Fribourg
> * Cathédrale
> * Pont de la Poya
> where blank spaces in the title may be wanted instead of getting single-term 
> topic names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2108) Non-semantic extraction of supplemental categories

Reply via email to