On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:

Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can use the accept-charset form attribute to explicitly specify UTF-8.

Yes, that's right (FWIW. I'm one of the authors of RFC 2040)... it'd be interesting to see how well ICU does with crawl data. Does anyone have any experience?



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to