[
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018544#comment-15018544
]
Lewis John McGibbney commented on NUTCH-2158:
---------------------------------------------
Hi [~jnioche], I reproduce your failing test as above.
When i try to detect the content type using tika-server trunk 1.12-SNAPSHOT I
get the following
{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new/src/plugin/protocol-http/jsp(joshua) $
curl -T basic-http.jsp http://localhost:9998/rmeta
[{"Content-Encoding":"UTF-8","Content-Type":"text/plain;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.txt.TXTParser"],"X-TIKA:content":"\u003chtml
xmlns\u003d\"http://www.w3.org/1999/xhtml\"\u003e\n\u003chead\u003e\n\u003cmeta
name\u003d\"X-Parsed-By\" content\u003d\"org.apache.tika.parser.DefaultParser\"
/\u003e\n\u003cmeta name\u003d\"X-Parsed-By\"
content\u003d\"org.apache.tika.parser.txt.TXTParser\" /\u003e\n\u003cmeta
name\u003d\"Content-Encoding\" content\u003d\"UTF-8\" /\u003e\n\u003cmeta
name\u003d\"Content-Type\" content\u003d\"text/plain; charset\u003dUTF-8\"
/\u003e\n\u003ctitle\u003e\u003c/title\u003e\n\u003c/head\u003e\n\u003cbody\u003e\u003cp\u003e\u0026lt;%--\n
Licensed to the Apache Software Foundation (ASF) under one or more\n
contributor license agreements. See the NOTICE file distributed with\n this
work for additional information regarding copyright ownership.\n The ASF
licenses this file to You under the Apache License, Version 2.0\n (the
\"License\"); you may not use this file except in compliance with\n the
License. You may obtain a copy of the License at\n \n
http://www.apache.org/licenses/LICENSE-2.0\n \n Unless required by applicable
law or agreed to in writing, software\n distributed under the License is
distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.\n See the License for the specific language
governing permissions and\n limitations under the
License.\n--%\u0026gt;\u0026lt;%--\n Example JSP Page to Test Protocol-Http
Plugin \n--%\u0026gt;\u0026lt;%@ page language\u003d\"java\"
import\u003d\"java.util.*\"
pageEncoding\u003d\"UTF-8\"%\u0026gt;\u0026lt;%\nString path \u003d
request.getContextPath();\nString basePath \u003d
request.getScheme()+\"://\"+request.getServerName()+\":\"+request.getServerPort()+path+\"/\";\n%\u0026gt;\n\n\u0026lt;!DOCTYPE
HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\u0026gt;\n\u0026lt;html
xmlns\u003d\"http://www.w3.org/1999/xhtml\"\u0026gt;\n
\u0026lt;head\u0026gt;\n \u0026lt;base
href\u003d\"\u0026lt;%\u003dbasePath%\u0026gt;\"\u0026gt;\n \n
\u0026lt;title\u0026gt;HelloWorld\u0026lt;/title\u0026gt;\n \u0026lt;meta
http-equiv\u003d\"content-type\" content\u003d\"text/html;charset\u003dutf-8\"
/\u0026gt;\n \u0026lt;meta name\u003d\"Language\" content\u003d\"en\"
/\u0026gt;\n\t\u0026lt;meta http-equiv\u003d\"pragma\"
content\u003d\"no-cache\"\u0026gt;\n\t\u0026lt;meta
http-equiv\u003d\"cache-control\"
content\u003d\"no-cache\"\u0026gt;\n\t\u0026lt;meta http-equiv\u003d\"expires\"
content\u003d\"0\"\u0026gt; \n\t\u0026lt;meta http-equiv\u003d\"keywords\"
content\u003d\"keyword1,keyword2,keyword3\"\u0026gt;\n\t\u0026lt;meta
http-equiv\u003d\"description\" content\u003d\"This is my
page\"\u0026gt;\n\t\u0026lt;!--\n\t\u0026lt;link rel\u003d\"stylesheet\"
type\u003d\"text/css\" href\u003d\"styles.css\"\u0026gt;\n\t--\u0026gt;\n
\u0026lt;/head\u0026gt;\n \n \u0026lt;body\u0026gt;\n Hello World!!!
\u0026lt;br\u0026gt;\n
\u0026lt;/body\u0026gt;\n\u0026lt;/html\u0026gt;\n\u003c/p\u003e\n\u003c/body\u003e\u003c/html\u003e","X-TIKA:parse_time_millis":"57"}]
{code}
This is detecting it as text/plain
> Upgrade to Tika 1.11
> --------------------
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
> Issue Type: Task
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)