Dear Roman,
        I am interested in the code that removes diacritics.

Although Richard said that this is "trivial", I am not so sure.
In theory this should be driven from the character data
in the Unicode database.
It should remove the diacritics from Western languages
I do not want to remove all combining characters, e.g. these
should be preserved for alphabetic languages of India.
I do not know the best approach for Cyrillic based languages.
So a sketch of an algorithm would be to convert to NFD 
"Unicode Normalization Form Decomposed"
and then remove the combining characters that followed 
(were combined with) characters on certain code pages only.
One key question is -- which code pages?
And are there some code pages where some combining
characters should be removed, but not others?

But doing all this still might be "straightforward"
and so I am curious as to how you do it.

Thanks,
Steve
-- 
Steven Tolkin          [EMAIL PROTECTED]      617-563-0516 
Fidelity Investments   82 Devonshire St. V4D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.


> -----Original Message-----
> From: Richard Jelinek [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 02, 2003 10:29 AM
> To: Tolkin, Steve
> Cc: [EMAIL PROTECTED]
> Subject: Re: NLP portal nlp.petamem.com
> 
> 
> On Wed, Apr 02, 2003 at 09:59:45AM -0500, Tolkin, Steve wrote:
> > Can you say more about this.
> 
> "immer gerne" :-)
> 
> > Is the source code available?
> 
> No. Especially not the NLP algorithms. But the page is based on Yawps
> (http://yawps.sourceforge.net/) which can be downloaded. But this is
> only the framework for the site.
> 
> We have extended it and will contribute back to Yawps (if that hasn't
> happened already - should check that with the developers).
> 
> > How to you decide which diacritics to add?
> > For example both Mueller and Muller get the umlaut added
> > on the "u".
> 
> Yes. :-) and as there is no Müeller, the results are correct - right?
> Basically we use some kind of expansion-reduction algorithm where we
> generate n hypotheses of a given "diacritics-less" word and then
> compare it with the statistical data we got from the analysis of large
> (and I mean large) corpora. Either irrelevant hypotheses are pruned
> out or the user gets a choice offer.
> 
> We plan to use the feedback from the choices made by users to improve
> our statistical data. Alas large corpora doesn't always mean good
> corpora, so there is some polluted data.
> 
> > Do you know of code that removes diacritics in a reasonable
> > way, e.g. for systems that can only handle ASCII.
> 
> This is trivial - we didn't dare to put this on the web. Ask Roman (rv
> instead of rj at my email adress) - he will provide you with 
> the snippet.
> 
> > Ideally your approach to ading diacritics would be fully reversible,
> > when processing the unaccented words,
> > but that is perhaps too idealistic.
> 
> Well - for czech it is fully reversible, but for german, as there may
> be groups of chars (ss -> ß, ue -> ü etc.) that are folded to one char
> only The path isn't reversible anymore. Mueller -> Müller -> Muller.
> 
> 
> -- 
> best regards,
> 
>      Dipl.-Inf. Richard Jelinek
> 
>      - PetaMem s.r.o. - Ocelarska 1 - Prague - www.petamem.com -
>                      -= 2026049 Mind Units =-
> 
> 

Reply via email to