Re: [Nutch-dev] Encoding problems

Doug Cutting Sun, 01 Feb 2004 15:13:34 -0800

Andrzej Bialecki wrote:

There is something fishy going on with the pages' encoding. I created an index of non-english pages (which use mainly Latin1 and Latin2), and the national characters are completely broken - both on the search results page, as well as cached view. Looks like somewhere Nutch assumes utf8 where it's not, or the other way around, or maybe the code in Entities.java is somewhat broken...

Can you please log a bug report for this and add the bad html as a file attachment. I don't have time to look at it right now, but I too have seen this stuff before, and we don't want to lose track of the issue.

The cache code is known to have problems with charsets. To work correctly I think we need to save the original encoding header, reuse it, and make sure that the text we add at the top is either in that encoding too, or is in a separate frame. Probably the cache code needs to become a servlet, rather than a jsp page, to get this right.

But the search results page should get charsets right, so long as the html parser (nekohtml) does the right thing, and Entities.java does, etc.

Oh, BTW: sometimes displaying the cached content takes ages. When I select "View source" in the browser, the full source is there, but the browser wants to get other external resources, like javascript files, other frames, etc. I have the impresssion that cached results of the same pages from Google are displayed quicker... Perhaps they cut off some links on the copy to other originally linked content?

I think they do the same thing Nutch does: use the raw HTML with a <base ...> tag at the top. Tell me if you see different.

Another observation about ranking: many times when the matching word is the main part of URL (in my case "ikea" -> www.ikea.com) it would make sense to boost up the results that are not deeply nested. E.g. I got the results from obscure corners on the website first, and then as the 20th or 40th I got the main page. Well, I understand that probably that obscure page had a high frequency of this term, but as a user I'd expect the main entry to the website to show up first. Any comments?

Have you done any link analysis? That should boost home pages. There's also a parameter, the url boost in QueryTranslator.java, which determines how much url matching (and a shorter url is a better match) should count as opposed to content matching. Some folks at Stanford did experiments that indicated that the anchor boost should probably be a *lot* higher. In general, these parameters need tweaking: their default values are wild guesses. But for a start you might try raising the url boost from 4.0 to 8.0 or something, and raising the anchor boost from 2.0 to 10.0 or higher. Tell us how it goes.

Doug

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Encoding problems

Reply via email to