[
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023123#comment-15023123
]
Sebastian Nagel commented on NUTCH-2158:
----------------------------------------
We need to the pass the rendered HTML, returned by the server (Jetty) for the
jsp page, to Tika. Done by adding a sleep to the unit test so that the document
can be fetched:
{noformat}
% wget -O basic-http.jsp.html -d http://127.0.0.1:47504/basic-http.jsp
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
...
Server: Jetty(6.1.26)
...
% cat basic-http.jsp.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<base href="http://127.0.0.1:47504/">
<title>HelloWorld</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<meta name="Language" content="en" />
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
<meta http-equiv="description" content="This is my page">
<!--
<link rel="stylesheet" type="text/css" href="styles.css">
-->
</head>
<body>
Hello World!!! <br>
</body>
</html>
% java -jar tika-app-1.10.jar -d basic-http.jsp.html
application/xhtml+xml
% java -jar tika-app-1.11.jar -d basic-http.jsp.html
text/html
{noformat}
It's definitely a change in Tika, probably by TIKA-1771 which lowers the
probability of {{application/xhtml+xml}}.
But we can probably live with this changed behavior, it's more an improvement
than a bug:
- both the HTTP header and the metadata claim {{text/html}}
- the document itself isn't clean XHTML
> Upgrade to Tika 1.11
> --------------------
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
> Issue Type: Task
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)