The problem is in charset. If we change charset of the page from UTF-16 to UTF-8 then Nutch fetches all urls on that page. If a page has charset UTF-16 then Nutch fetches just the first url but not the urls on that page like with the utf8 charset. Here are two examples:

UTF16 - no additional urls for next cycle
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
</head>
<body>
  <a href="slo.php">sdfsdf</a>
  <a class="ai" href="info.aspx?docid=54046">test</a>
  <a href="http://www.something.com";>something.com</a>
</body>
</html>

UTF8 - first two urls are fetched for next cycle and this is OK.
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
  <a href="slo.php">sdfsdf</a>
  <a class="ai" href="info.aspx?docid=54046">test</a>
  <a href="http://www.something.com";>something.com</a>
</body>
</html>

Thanks!

Best regards,
Vasja

Doğacan Güney wrote:
On 9/11/07, Vasja Ocvirk <[EMAIL PROTECTED]> wrote:
Does anyone know what to do if Nutch doesn't crawl and index web pages
in UTF-16? Did anyone had such a problem yet?

Nutch should work with UTF-16. Can you describe your problem in more detail?

Best regards,
Vasja



Reply via email to