[ 
https://issues.apache.org/jira/browse/CONNECTORS-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172266#comment-14172266
 ] 

Karl Wright edited comment on CONNECTORS-1074 at 10/15/14 11:44 AM:
--------------------------------------------------------------------

Hi Abe-san,

Replacing ExtensionMimeMap wherever it is used with a Tika.detect(String 
filename) call, which is what you are proposing, will require that all Tika 
jars and their dependencies be included in all the war files, rather than once 
(in connector-lib).  This is because they will be required to be accessed by 
the root class loader.  It may be the case that you could put only one Tika jar 
and have the detect(String filename) method work, but you would need to 
experiment to see.

In the Tika connector itself, the RepositoryDocument.setMimeType() method is 
supposed to describe the binary stream that you get from 
RepositoryDocument.getBinaryStream().  Since the output of Tika is always 
characters, which the Tika transformer converts to utf-8 bytes, the content 
type should always be "text/plain;charset=utf-8".

If you want to modify the Tika connector to report what the *original* mime 
type was in some other metadata field, that is fine with me, but you should not 
call setMimeType() because it will break things.



was (Author: [email protected]):
Hi Abe-san,

the RepositoryDocument.setMimeType() method is supposed to describe the binary 
stream that you get from RepositoryDocument.getBinaryStream().  Since the 
output of Tika is always characters, which the Tika transformer converts to 
utf-8 bytes, the content type should always be "text/plain;charset=utf-8".

If you want to modify the Tika connector to report what the *original* mime 
type was in some other metadata field, that is fine with me, but you should not 
call setMimeType() because it will break things.


> Replace ExtensionMimeMap with new Tika().detect(filename)
> ---------------------------------------------------------
>
>                 Key: CONNECTORS-1074
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1074
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework core
>            Reporter: Shinichiro Abe
>             Fix For: ManifoldCF 2.0
>
>
> It would be nice if we could support many mime type since ManifoldCF has 
> already been using Tika.
> {noformat}
>  new Tika().detect(fileName);
> {noformat}
> returns String MimeType. Then we could set this into 
> RepositoryDocument#setMimeType(mimeType) on each connector;
> Tika reference:
> [javadoc|http://tika.apache.org/1.6/api/org/apache/tika/Tika.html]
> [test 
> code|http://svn.apache.org/viewvc/tika/tags/1.6-rc2/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java?view=markup]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to