Re: Unicode for words?

D. Starner Sun, 05 Dec 2004 16:37:00 -0800

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> > Drop the part of the sentence before "then". A protocol could delete "the", 
> > "an", etc. right
> > now. In fact, I suspect several library systems do drop "the", etc. right 
> > now. Not that this
> > makes it a good idea, but that's a lousy argument.
> 
> If such a library does this, only based on the presence of the encoded words, 
> without wondering 
> in which language the text is written, that kind of processing text will be 
> seriously 
> inefficient or inaccurate when processing other languages than English for 
> which you will have 
> built such a library.


Many libraries have large amounts of books in English, French, German, Spanish, 
Italian, 
and various non-Latin languages. Blanket stripping of a, an, the, and la from 
the 
start of a title might very well be good 90% heuristic for removing non-sorting
words from the start of titles. (German being the odd man out, since you can't 
blanket
remove a starting die.)

> For plain-text (which is what Unicode deals about), even the "an", "the", 
> "is" words (and so 
> on...) are equally important as other parts of the text. 

No. It all depends on what you want to do with the text.

Besides which, the point is it doesn't matter whether or not words are encoded 
as 
codepoints; these process can work just the same.
-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode for words?

Reply via email to