Rupert Westenthaler created STANBOL-809:
-------------------------------------------

             Summary: Parse ConentItem URI to the Tika content type detector
                 Key: STANBOL-809
                 URL: https://issues.apache.org/jira/browse/STANBOL-809
             Project: Stanbol
          Issue Type: Bug
          Components: Engine - Tika
            Reporter: Rupert Westenthaler
            Priority: Minor


The content type detection could be improved by using the URI of the processed 
content item as the Tika API allows to explicitly parse the file name (or URI) 
of an resource as input parameter to the content type detection. (see 
https://tika.apache.org/1.2/detection.html#Resource_Name_Based_Detection)

    Metadata m = new Metadata();
    m.add(Metadata.RESOURCE_NAME_KEY,
        contentItem.getUri().getUnicodeString());
    detector.detect(is, m)

this would mean that the filename pattern based recognition would
work when you manually set the contentItem URI in the request to the Stanbol 
enhancer e.g.

     curl -X POST -H "Accept: text/turtle" -T test.docx \
         http://dev.iks-project.eu:8080/enhancer/engine/tika?id=\
         http://www.example.com/test.docx

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to