According to Thilo Bauer: > Thinking a little bit about internationalization. > > Locales are nice and for htdig it seems a good way > to find out what is a printable character and what > is not. > > But, > DO I NEED A SPECIAL LOCALE? > > Sometimes yes. But, preferably not. > > Assume: > > 1. you are member of an organization coming with a > great scientific magazine. The publishing language > in fact would be english for better understanding. > All your authors come from countries all around the > world, many of them are Europeans, some are Chinese > or Japans, etc. > > 2. the magazine mentioned should be published in the > web and you want to provide htdig as a search engine. > > 3. names usually will contain any printable character > which you can find in ISO 8859-1 and others. > > Question: DO THE PEOPLE HERE RELY ON LOCALES? > > NO! Otherwise (some) names in general wouldn't be found > in the search engine, esp. if you want to provide special > and/or different locales from htdig. > > This leads to my conclusion: I preferably don't need > a locale. What I really need is a full ISO character set > for general purpose. This should be the default and > without any assumption on what is found on the local > operating system. > > I only have to distinguish between printable characters > that *may*be*contained*in*a*word* (e.g. a name) and > characters, that don't. And I only need a conversion > table to convert these characters into their corresponding > lower case values, like htdig wishes to do to build > a word database and index tables. > > So, maybe it would be a better way to provide htdig > with a default behaviour of presenting a complete > character set for one of the most common ISO standards, > e.g. ISO 8859-1. > > What really does a locale? Right, the same thing: it > provides a character mapping table, which describes > the interpretation of character codes. > > Other intersting questions: > > What, if the charset is explicitly defined within the > document and neither htdig, nor your current locale > won't match? - Good question... > > What is the main focus of a search engine? > > - Right: index and search web documents. > - False: index documents with assumptions where itself lives. > - False again: ignore the content and charsets of the document itself. > > Finally it could be a better way to fully support > UTF-8 instead of 8-bit characters only...
You make some very good points, but I think they've all been covered in previous discussions on the topic. I suggest you search through the mailing list archives to catch up on earlier discussions. You don't need to convince us of the need for all this - we already know that we need to be able to support an expanded character set like UTF-8 or Unicode, independently of system locale definitions, to get htdig to work properly with any language. What we need is someone who's willing to tackle this non-trivial project. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

