Hi,

Steve Tolkin wrote:

> Can you say more about this.

I hope so. However I will refer to the new functionality
of the relaunched portal at nlp.petamem.com.

> Is the source code available?

Partly. Some of the underlying modules for e.g. numeral
conversion are available on CPAN, other code is proprietary.

> How to you decide which diacritics to add?
> For example both Mueller and Muller get the umlaut added
> on the "u".

Yes. And for Muller it may be wrong sometimes. It's
a plain statistical process where a wordlist - taken
from a corpus of the resp. language (german in that case)
is compared with the words given for diacritization.

Now the system knows about some equivalents for a given
language, so u<=>�, ue<=>�, "u<=>� etc. This can be wrong
of course without any consideration of the context.

The system then may or may not find a list of alternatives
with diacritics and offers these for the user to choose.

> Do you know of code that removes diacritics in a reasonable
> way, e.g. for systems that can only handle ASCII.

Yes. Have a look at the new portal. It does exactly this
in the diacritics operations section. In fact there are
now three modes of operation "Choose", "Fit1st" and "Remove".

> Ideally your approach to ading diacritics would be fully reversible,

Yes. But we have a long way to go to achieve this.



Reply via email to