On Mon, Apr 12, 2010 at 1:31 AM, Francis Tyers <[email protected]> wrote:
> El dl 12 de 04 de 2010 a les 01:30 +0530, en/na Dhiraj Lohiya va
> escriure:
> >
> >
> > Diacritic restoration in context of Indian scripts:
> > I would be taking some trivial examples to put forward the approach.
> >
> >
> > Scenario 1: [Input is in Unicode, but diacritically incorrect]
> >
> >
> > Indian scripts written in unicode as well suffer from incorrect
> > diacritics.
> >
> >
> > For e.g.:
> > [1] -> Text with incorrect diacritics:
> > हींदी भारत कि राष्ट्रिय भाषा हैं.
>
> >
> >
> > [2] -> Text with corrected diacritics:
> > हिंदी भारत की राष्ट्रीय भाषा है.
> >
>
Now, one fact that would need to be considered with respect to diacritic
restoration in the above scenario, for example if we take ह in हिन्दी , it
should always be restored to only amongst हि, ही, हीं or हिं. And not हुं,
हूँ, हूं etc.
So if we try to follow a similar approach for the base case like in Latin
scripts wherein we completely strip the diacritics, here the best case that
we would arrive at would be:
हिदि भारत कि राष्ट्रिय भाषा हे.
and not
हद भरत क रष्टय भश ह .
Since this way, we would actually be completely removing the vowels itself.
There will also need to be a few more factors to be considered in case of
some vowels/consonants depending on the way they are pronounced on a case by
case basis.
Now this would serve as the baseline. Since the script is different, a
reference to this [1] <http://en.wikipedia.org/wiki/Devanagari> might be
helpful for a more detailed description.
> >
> > Scenario 2: [Input is in ascii]
> >
> >
> > Indian scripts written in ascii by users(transliterated) as well
> > suffer from incorrect diacritics.
> >
> >
> > For e.g.:
> > [3] -> Text with incorrect diacritics:
> > Hindi bharat ki rashtriya bhasha hai.
> >
> >
> > [4] -> Text with corrected diacritics:
> >
> > Hindī bhārata kī rāṣṭrīya bhāṣā hai.
> >
> >
> > [4] might not be feasible since practically no one writes Hindi
> > transliterated in English with proper diacritic and hence probably we
> > won't get enough corpora for training.
>
> Couldn't you transliterate from devanagari -> proper diacritics latin
> and then strip the diacritics to get your training corpus ?
>
While this approach would work, as Francis had pointed out that stripping
diacritics wouldn't consider for mappings like ṣ -> sh,
So a solution to this can be approached considering the fact that these are
a few standard cases wherein this would specifically need to be considered
like:
ṣ -> sh | ष
ś -> sh | श
c-> ch | च
So we could consider the vowel-consonant pairs and hence work this out using
a pre-defined set of mappings.
All these mappings in both the scenarios could be easily looked up relating
to their equivalent Latin script transliterations of individual vowels and
consonants.
Inputs!
[1]: http://en.wikipedia.org/wiki/Devanagari
[2]:
http://condor.depaul.edu/~dgitomer/General%20course%20resources/diacritics.html<http://condor.depaul.edu/%7Edgitomer/General%20course%20resources/diacritics.html>
[3]: http://www.omniglot.com/writing/hindi.htm
[4]: http://en.wiktionary.org/wiki/Wiktionary:Hindi_transliteration
--
Regards
Dhiraj Lohiya
IRC nick: Dhiraj
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff