Mark Dilger <[email protected]> writes:
> I am a bit surprised to see that you are right about this, because non-latin
> languages often have transliteration/romanization schemes for writing the
> language in the Latin alphabet, developed before computers had wide spread
> adoption of non-ASCII character sets, and still in use today for text
> messaging. I expected to find stemming rules for transliterated words, but
> can't find any indication of that, neither in the postgres sources, nor in
> the snowball sources I pulled from their repo. Is there some architectural
> separation of stemming from transliteration such that we'd never need to
> worry about it? If snowball ever published stemmers for transliterated text,
> we might have to revisit this issue, but for now your proposed change sounds
> fine to me.
Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different. But they don't, AFAICS. Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.
The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not. AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.
regards, tom lane