Eric and Mike wrote: > > Maybe I should draw search results from MyLibrary and not > swish-e to > > display characters correctly? If I draw content from many global > > sources, then how do I know what character set to use for display? > > > > This is definitely the best thing to do. Search the > normallized data and display the original. Also, if you > store the documents UTF-8 encoded you won't need to worry > about the character set, you just need to set the encoding > for the page to UTF-8 and the browser will take care of the rest. >
Bear in mind that even in UTF-8 there is more than one way to encode an accented character. It can be precomposed (using a single character, e.g. U0089 for lower-case e-acute: this is normalization form C) or decomposed (using a base character and a non-spacing diacritic, e.g. U0065 and U0301, lower-case e plus the acute accent: this is normalization form D). If you're searching at the byte level, you have to be sure that your index and your search term have been normalized the same way or they won't match. I've found this FAQ useful for this stuff: http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context, we've used ICU4J (http://icu.sourceforge.net) to normalize stuff (including stripping accents and normalizing case for different scripts) for indexing and searching in UTF-8. There's also a C API, which could presumably be incorporated into a Perl process, but no doubt there are similar native Perl tools. In general I think we've got to include i18n from the beginning: pay attention to character sets of incoming data, normalize as early in the process as possible (especially if ANSEL is involved!), use UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser (this site helps with the html: http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still not as easy as it ought to be but at least there are good open-source tools out there. Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta Canada T6G 2J8 Phone: (780) 492-3743 Fax: (780) 492-9243 e-mail: [EMAIL PROTECTED]