Hi, Steve Tolkin wrote:
> Can you say more about this. I hope so. However I will refer to the new functionality of the relaunched portal at nlp.petamem.com. > Is the source code available? Partly. Some of the underlying modules for e.g. numeral conversion are available on CPAN, other code is proprietary. > How to you decide which diacritics to add? > For example both Mueller and Muller get the umlaut added > on the "u". Yes. And for Muller it may be wrong sometimes. It's a plain statistical process where a wordlist - taken from a corpus of the resp. language (german in that case) is compared with the words given for diacritization. Now the system knows about some equivalents for a given language, so u<=>�, ue<=>�, "u<=>� etc. This can be wrong of course without any consideration of the context. The system then may or may not find a list of alternatives with diacritics and offers these for the user to choose. > Do you know of code that removes diacritics in a reasonable > way, e.g. for systems that can only handle ASCII. Yes. Have a look at the new portal. It does exactly this in the diacritics operations section. In fact there are now three modes of operation "Choose", "Fit1st" and "Remove". > Ideally your approach to ading diacritics would be fully reversible, Yes. But we have a long way to go to achieve this.
