The problem is in charset. If we change charset of the page from UTF-16
to UTF-8 then Nutch fetches all urls on that page. If a page has charset
UTF-16 then Nutch fetches just the first url but not the urls on that
page like with the utf8 charset. Here are two examples:
UTF16 - no additional urls for next cycle
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
</head>
<body>
<a href="slo.php">sdfsdf</a>
<a class="ai" href="info.aspx?docid=54046">test</a>
<a href="http://www.something.com">something.com</a>
</body>
</html>
UTF8 - first two urls are fetched for next cycle and this is OK.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<a href="slo.php">sdfsdf</a>
<a class="ai" href="info.aspx?docid=54046">test</a>
<a href="http://www.something.com">something.com</a>
</body>
</html>
Thanks!
Best regards,
Vasja
Doğacan Güney wrote:
On 9/11/07, Vasja Ocvirk <[EMAIL PROTECTED]> wrote:
Does anyone know what to do if Nutch doesn't crawl and index web pages
in UTF-16? Did anyone had such a problem yet?
Nutch should work with UTF-16. Can you describe your problem in more detail?
Best regards,
Vasja