[jira] [Updated] (TIKA-2108) Non-semantic extraction of supplemental categories

Xavier Perseguers (JIRA) Mon, 03 Oct 2016 00:23:59 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xavier Perseguers updated TIKA-2108:
------------------------------------
    Description: 
When extracting metadata for a file with categories, the comma (or semi-colon - 
according to the IPTC specification) -separated list of categories is extracted 
as a blank space-separated list of terms.

Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg

{code}
java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
<meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la 
Poya"/>
{code}

When using exiftool:

{code}
exiftool sample.jpg  | grep Categories
Supplemental Categories         : Fribourg, Cathédrale, Pont de la Poya
{code}

This is not a problem when using Tika with, say, Solr since stopwords such as 
"de" and "la" will be dropped. However, when using Tika standalone with an 
external tool, there is no way to fetch the *actual* list of categories, namely:

* Fribourg
* Cathédrale
* Pont de la Poya

where blank spaces in the title may be wanted instead of getting single-term 
topic names.

  was:
When extracting metadata for a file with categories, the comma (or semi-colon - 
according to the IPTC specification) -separated list of categories is extracted 
as a blank space-separated list of terms.

Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg

```
java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
<meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la 
Poya"/>
```

When using exiftool:

```
exiftool sample.jpg  | grep Categories
Supplemental Categories         : Fribourg, Cathédrale, Pont de la Poya
```

This is not a problem when using Tika with, say, Solr since stopwords such as 
"de" and "la" will be dropped. However, when using Tika standalone with an 
external tool, there is no way to fetch the *actual* list of categories, namely:

* Fribourg
* Cathédrale
* Pont de la Poya

where blank spaces in the title may be wanted instead of getting single-term 
topic names.


> Non-semantic extraction of supplemental categories
> --------------------------------------------------
>
>                 Key: TIKA-2108
>                 URL: https://issues.apache.org/jira/browse/TIKA-2108
>             Project: Tika
>          Issue Type: Bug
>         Environment: tika-app-1.13.jar, tika-server-1.13.jar
>            Reporter: Xavier Perseguers
>            Priority: Minor
>
> When extracting metadata for a file with categories, the comma (or semi-colon 
> - according to the IPTC specification) -separated list of categories is 
> extracted as a blank space-separated list of terms.
> Example with https://dl.dropboxusercontent.com/u/3177102/sample.jpg
> {code}
> java -jar tika-app-1.13.jar --xml sample.jpg | grep Category
> <meta name="Supplemental Category(s)" content="Fribourg Cathédrale Pont de la 
> Poya"/>
> {code}
> When using exiftool:
> {code}
> exiftool sample.jpg  | grep Categories
> Supplemental Categories         : Fribourg, Cathédrale, Pont de la Poya
> {code}
> This is not a problem when using Tika with, say, Solr since stopwords such as 
> "de" and "la" will be dropped. However, when using Tika standalone with an 
> external tool, there is no way to fetch the *actual* list of categories, 
> namely:
> * Fribourg
> * Cathédrale
> * Pont de la Poya
> where blank spaces in the title may be wanted instead of getting single-term 
> topic names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-2108) Non-semantic extraction of supplemental categories

Reply via email to