[ 
https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486904#comment-16486904
 ] 

Sebastian Nagel commented on TIKA-2648:
---------------------------------------

Hi Gerard,
thanks for bringing this up. It actually affects some more MIME types related 
to server-side scripting/webapp/etc. formats, see [discussion 
user@tika|https://lists.apache.org/thread.html/1e4f4b6c249618a446f2e92f56ef90e6bfa0dfe51ce10197461df3d9@%3Cuser.tika.apache.org%3E]
 (sorry, I've never found the time to dig into it).
The file extension {{.php}} is an indicator for the MIME type {{text/x-php}} 
only for a (local) file system. It does practically give no glue for a URL - 
it's likely HTML but the server may use PHP also to render PDF or to deliver 
images. I don't know whether
* there is a way to model this distinction between URL and file glob patterns 
in the tika-mimetypes.xml ?
* could the [probabilistic 
detector|https://wiki.apache.org/tika/BaysianMimeTypeSelector] help? We could 
give more weight to the Content-Type sent by the server. Of course, this may 
affect the detection of {{.pdf}}, {{.docx}}, etc.
* only as last solution: modify tika-mimetypes.xml so that it matches the needs 
of a crawler and maintain it separately (as part of Nutch).

Regarding your example: if the content is passed to Tika the MIME type is 
correctly recognized. Of course, that's no excuse and often fails because 
content is only a HTML fragment, not well-formed, or just not easy to detect.

> mime detection based on resource name detects resources as "text/x-php" 
> instead of "text/html" 
> -----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2648
>                 URL: https://issues.apache.org/jira/browse/TIKA-2648
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> When using tika to detect a mime type given only an URL containing ".php" and 
> a content-type hint of "text/html", it guesses "text/x-php", whereas one 
> could expect "text/html".
> {code}
> TikaConfig tika = new TikaConfig();
> Metadata metadata = new Metadata();
> String url = "https://www.facebook.com/home.php";;
> metadata.set(Metadata.RESOURCE_NAME_KEY, url);
> metadata.set(Metadata.CONTENT_TYPE, "text/html");
> MediaType type = tika.getDetector().detect(null, metadata);
> System.out.println(url + " is of type " + type.toString());
> // Prints https://www.facebook.com/home.php is of type text/x-php
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to