html"

Sebastian Nagel (JIRA) Mon, 09 Jul 2018 01:54:39 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536703#comment-16536703
 ]


Sebastian Nagel commented on TIKA-2648:
---------------------------------------

Yes, but also if a web site is mirrored (e.g. using wget) the downloaded files 
are saved with the extension {{.php}} but are HTML (could be also PDF or any 
other MIME type). If you then call Tika on the files, the solution does not 
work. But I agree that [~gbouchar]'s fix is better than nothing. The access 
pattern (file or HTTP) is also a strong hint whether to trust the file 
extension or not. It was more a question from my side, whether a more 
generalized solution is possible: give the file extension {{.php}} in general 
less weight and rely on the content itself or (if available) the Content-Type 
sent in HTTP header? It could be something similar to the magic priorities. -- 
But in any case it's better to have a solution now applicable for web crawlers.

> mime detection based on resource name detects resources as "text/x-php" 
> instead of "text/html" 
> -----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2648
>                 URL: https://issues.apache.org/jira/browse/TIKA-2648
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> When using tika to detect a mime type given only an URL containing ".php" and 
> a content-type hint of "text/html", it guesses "text/x-php", whereas one 
> could expect "text/html".
> {code}
> TikaConfig tika = new TikaConfig();
> Metadata metadata = new Metadata();
> String url = "https://www.facebook.com/home.php";;
> metadata.set(Metadata.RESOURCE_NAME_KEY, url);
> metadata.set(Metadata.CONTENT_TYPE, "text/html");
> MediaType type = tika.getDetector().detect(null, metadata);
> System.out.println(url + " is of type " + type.toString());
> // Prints https://www.facebook.com/home.php is of type text/x-php
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2648) mime detection based on resource name detects resources as "text/x-php" instead of "text/html"

Reply via email to