[ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018544#comment-15018544
 ] 

Lewis John McGibbney commented on NUTCH-2158:
---------------------------------------------

Hi [~jnioche], I reproduce your failing test as above.
When i try to detect the content type using tika-server trunk 1.12-SNAPSHOT I 
get the following
{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new/src/plugin/protocol-http/jsp(joshua) $ 
curl -T basic-http.jsp http://localhost:9998/rmeta
[{"Content-Encoding":"UTF-8","Content-Type":"text/plain; 
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.txt.TXTParser"],"X-TIKA:content":"\u003chtml
 
xmlns\u003d\"http://www.w3.org/1999/xhtml\"\u003e\n\u003chead\u003e\n\u003cmeta 
name\u003d\"X-Parsed-By\" content\u003d\"org.apache.tika.parser.DefaultParser\" 
/\u003e\n\u003cmeta name\u003d\"X-Parsed-By\" 
content\u003d\"org.apache.tika.parser.txt.TXTParser\" /\u003e\n\u003cmeta 
name\u003d\"Content-Encoding\" content\u003d\"UTF-8\" /\u003e\n\u003cmeta 
name\u003d\"Content-Type\" content\u003d\"text/plain; charset\u003dUTF-8\" 
/\u003e\n\u003ctitle\u003e\u003c/title\u003e\n\u003c/head\u003e\n\u003cbody\u003e\u003cp\u003e\u0026lt;%--\n
  Licensed to the Apache Software Foundation (ASF) under one or more\n  
contributor license agreements.  See the NOTICE file distributed with\n  this 
work for additional information regarding copyright ownership.\n  The ASF 
licenses this file to You under the Apache License, Version 2.0\n  (the 
\"License\"); you may not use this file except in compliance with\n  the 
License.  You may obtain a copy of the License at\n  \n  
http://www.apache.org/licenses/LICENSE-2.0\n  \n  Unless required by applicable 
law or agreed to in writing, software\n  distributed under the License is 
distributed on an \"AS IS\" BASIS,\n  WITHOUT WARRANTIES OR CONDITIONS OF ANY 
KIND, either express or implied.\n  See the License for the specific language 
governing permissions and\n  limitations under the 
License.\n--%\u0026gt;\u0026lt;%--\n  Example JSP Page to Test Protocol-Http 
Plugin  \n--%\u0026gt;\u0026lt;%@ page language\u003d\"java\" 
import\u003d\"java.util.*\" 
pageEncoding\u003d\"UTF-8\"%\u0026gt;\u0026lt;%\nString path \u003d 
request.getContextPath();\nString basePath \u003d 
request.getScheme()+\"://\"+request.getServerName()+\":\"+request.getServerPort()+path+\"/\";\n%\u0026gt;\n\n\u0026lt;!DOCTYPE
 HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\u0026gt;\n\u0026lt;html 
xmlns\u003d\"http://www.w3.org/1999/xhtml\"\u0026gt;\n  
\u0026lt;head\u0026gt;\n    \u0026lt;base 
href\u003d\"\u0026lt;%\u003dbasePath%\u0026gt;\"\u0026gt;\n    \n    
\u0026lt;title\u0026gt;HelloWorld\u0026lt;/title\u0026gt;\n    \u0026lt;meta 
http-equiv\u003d\"content-type\" content\u003d\"text/html;charset\u003dutf-8\" 
/\u0026gt;\n    \u0026lt;meta name\u003d\"Language\" content\u003d\"en\" 
/\u0026gt;\n\t\u0026lt;meta http-equiv\u003d\"pragma\" 
content\u003d\"no-cache\"\u0026gt;\n\t\u0026lt;meta 
http-equiv\u003d\"cache-control\" 
content\u003d\"no-cache\"\u0026gt;\n\t\u0026lt;meta http-equiv\u003d\"expires\" 
content\u003d\"0\"\u0026gt;    \n\t\u0026lt;meta http-equiv\u003d\"keywords\" 
content\u003d\"keyword1,keyword2,keyword3\"\u0026gt;\n\t\u0026lt;meta 
http-equiv\u003d\"description\" content\u003d\"This is my 
page\"\u0026gt;\n\t\u0026lt;!--\n\t\u0026lt;link rel\u003d\"stylesheet\" 
type\u003d\"text/css\" href\u003d\"styles.css\"\u0026gt;\n\t--\u0026gt;\n  
\u0026lt;/head\u0026gt;\n  \n  \u0026lt;body\u0026gt;\n    Hello World!!! 
\u0026lt;br\u0026gt;\n  
\u0026lt;/body\u0026gt;\n\u0026lt;/html\u0026gt;\n\u003c/p\u003e\n\u003c/body\u003e\u003c/html\u003e","X-TIKA:parse_time_millis":"57"}]
{code}
This is detecting it as text/plain

> Upgrade to Tika 1.11
> --------------------
>
>                 Key: NUTCH-2158
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2158
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Julien Nioche
>             Fix For: 1.11
>
>         Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to