On Tue, 26 Jan 2010 14:42:34 -0600, Ron Catterall <[email protected]> wrote: > Imagine a linguist wanting to search some text to count > ... > The problem of course is not a Docbook problem, it is in the UTF tables
The problem is with neither, it is with the linguist :-). (I can say that, because I'm a linguist.) All seriousness aside, using corpora for linguistics requires more than looking for certain Unicode characters, which may not be used consistently anyway (and especially in a case like this, where the characters--if they were distinct Unicode characters--would doubtless be confused). Distinguishing between quotes and apostrophes requires some fairly complex methods. There are rules of thumb that often work, but they will break on certain cases. Corpora linguists become familiar with where these things break, and construct work-arounds accordingly, or hand-tag recalcitrant cases. If you really want an interesting problem, go for distinguishing among the uses of the ASCII period! Mike Maxwell --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
