"Philippe Verdy" <[EMAIL PROTECTED]> writes: > > Drop the part of the sentence before "then". A protocol could delete "the", > > "an", etc. right > > now. In fact, I suspect several library systems do drop "the", etc. right > > now. Not that this > > makes it a good idea, but that's a lousy argument. > > If such a library does this, only based on the presence of the encoded words, > without wondering > in which language the text is written, that kind of processing text will be > seriously > inefficient or inaccurate when processing other languages than English for > which you will have > built such a library.
Many libraries have large amounts of books in English, French, German, Spanish, Italian, and various non-Latin languages. Blanket stripping of a, an, the, and la from the start of a title might very well be good 90% heuristic for removing non-sorting words from the start of titles. (German being the odd man out, since you can't blanket remove a starting die.) > For plain-text (which is what Unicode deals about), even the "an", "the", > "is" words (and so > on...) are equally important as other parts of the text. No. It all depends on what you want to do with the text. Besides which, the point is it doesn't matter whether or not words are encoded as codepoints; these process can work just the same. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

