Subject: mueller7accent-dict: Phonetic Transcription to display correct UTF8
for mueller dict packages
Package: mueller7accent-dict
Version: 2002.02.27-3.2
Severity: normal
Tags: patch
*** Please type your report below this line ***
This could also be considered as extension of bug #92351 and is probably
just a wishlist.
The phonetic transcription of some words do not display correctly.
e.g.
dog
[dâg]
1. _n.
1: ÑÐÐÐÐÐ, ÐÑÑ; Greater (lesser) Dog ..
Should display:
dog
[dÉg]
1. _n.
1: ÑÐÐÐÐÐ, ÐÑÑ; Greater (lesser) Dog ..
In fact even some trascriptions are deleted in the to-dict.sh script.
e.g.
bread
1. _n.
1: ÑÐÐÐ; _ÐÐÑÐÐ. ÐÑÑÐÐ ÑÐÐÐÐ ....
Whereas keeping the phonetic transcription would display (by simply
removing the --no-trans section from the debian/rules file)
bread
[bred]
1. _n.
1: ÑÐÐÐ; _ÐÐÑÐÐ. ÐÑÑÐÐ ÑÐÐÐÐ ....
Included is a patch to go against the debian patched source.
Added is a perl script that Fixes the phonetic transcriptions by
modifying the incorrect UTF8 bytes and rewriting them as their
(hopefully) correct IPA UTF8 counterpart.
This is more a "completeness" issue so that transcriptions
are in fact displayed correctly.
Not sure if keeping the transcriptions will break any required dict
format. Also there will be the additional requirement of needing perl
to build successfully. It should work for any perl version 5.6 and up.
Hopefully someone might find this useful.
Thanks
Chris Donoghue
diff -Naur mueller.orig/debian/rules mueller/debian/rules
--- mueller.orig/debian/rules 2005-03-18 20:32:37.000000000 +1100
+++ mueller/debian/rules 2005-03-22 12:19:50.000000000 +1100
@@ -21,17 +21,22 @@
# patch does not set executable flag
chmod a+x debian/scripts/to-dict.sh
+ chmod a+x debian/scripts/upgrade_trans.pl
- debian/scripts/to-dict.sh --no-trans Mueller7accentGPL.koi
mueller7accent.notr
- debian/scripts/to-dict.sh --src-data mueller7accent.notr
mueller7accent.data
+ # Keep the phonetic transcription. Most stayed anyway, so let's just
keep them all. The phonetic transription is upgraded to correct UTF8 encoding
in the to-dict.sh using perl program
+ # debian/scripts/to-dict.sh --no-trans Mueller7accentGPL.koi
mueller7accent.notr
+ # debian/scripts/to-dict.sh --src-data mueller7accent.notr
mueller7accent.data
+ debian/scripts/to-dict.sh --src-data Mueller7accentGPL.koi
mueller7accent.data
debian/scripts/to-dict.sh --data-dict mueller7accent.data mueller7accent
-rm -f mueller7.data mueller7.notr
debian/scripts/to-dict.sh --expand-index mueller7accent.index
mueller7accent.index.exp
sort -k 1,1 mueller7accent.index.exp > mueller7accent.index
-rm -f mueller7accent.index.exp
- debian/scripts/to-dict.sh --no-trans Mueller7GPL.koi mueller7.notr
- debian/scripts/to-dict.sh --src-data mueller7.notr mueller7.data
+ # Keep the phonetic transcription. Most stayed anyway, so let's just
keep them all. The phonetic transription is upgraded to correct UTF8 encoding
in the to-dict.sh using perl program
+ # debian/scripts/to-dict.sh --no-trans Mueller7GPL.koi mueller7.notr
+ # debian/scripts/to-dict.sh --src-data mueller7.notr mueller7.data
+ debian/scripts/to-dict.sh --src-data Mueller7GPL.koi mueller7.data
debian/scripts/to-dict.sh --data-dict mueller7.data mueller7
-rm -f mueller7.data mueller7.notr
debian/scripts/to-dict.sh --expand-index mueller7.index
mueller7.index.exp
diff -Naur mueller.orig/debian/scripts/to-dict.sh
mueller/debian/scripts/to-dict.sh
--- mueller.orig/debian/scripts/to-dict.sh 2005-03-18 20:32:37.000000000
+1100
+++ mueller/debian/scripts/to-dict.sh 2005-03-22 12:19:50.000000000 +1100
@@ -13,6 +13,9 @@
DICTFMT=`which dictfmt`
DICTZIP=`which dictzip`
+# and upgrade phonetics transcription perl script
+UPGTRANS=`dirname $0`/upgrade_trans.pl
+
INFO () {
echo "
to-dict, version $version ($versiondate).
@@ -166,6 +169,7 @@
# -s "$TITLE" $3 < $2 || exit 1
recode -f KOI8-RU..UTF-8 < $2 |\
+ LC_ALL=C $UPGTRANS |\
LOCPATH=locale dictfmt -p --allchars --locale ru_RU.utf-8\
-u "http://www.chat.ru/~mueller_dic" -s "$TITLE" $3
diff -Naur mueller.orig/debian/scripts/upgrade_trans.pl
mueller/debian/scripts/upgrade_trans.pl
--- mueller.orig/debian/scripts/upgrade_trans.pl 1970-01-01
10:00:00.000000000 +1000
+++ mueller/debian/scripts/upgrade_trans.pl 2005-03-22 14:31:39.000000000
+1100
@@ -0,0 +1,34 @@
+#!/usr/bin/perl
+
+while(<STDIN>)
+{
+ $linemod=$_;
+ $linemod=~s/\[(.*?)\]/&pronmod($&)/eg;
+ print $linemod;
+
+}
+
+sub pronmod
+{
+ $phword=$_[0]; $word=$phword;
+ $chf=chr(0x51); $cht=chr(0xc3).chr(0xa6); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x41); $cht=chr(0xc9).chr(0x91); $phword=~s/$chf/$cht/g;
+ $chf=chr(0xd0).chr(0xab); $cht=chr(0xcb).chr(0x90); $phword=~s/$chf/$cht/g;
+ $chf=chr(0xe2).chr(0x95).chr(0x9a); $cht=chr(0xc9).chr(0x99);
$phword=~s/$chf/$cht/g;
+ $chf=chr(0x45); $cht=chr(0xc9).chr(0x9b); $phword=~s/$chf/$cht/g;
+ $chf=chr(0xe2).chr(0x96).chr(0x88); $cht=chr(0xc9).chr(0x94);
$phword=~s/$chf/$cht/g;
+ $chf=chr(0xd1).chr(0x86); $cht=chr(0xca).chr(0x8c); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x49); $cht=chr(0x69); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x69); $cht=chr(0x69); $phword=~s/$chf/$cht/g;
+ $chf=chr(0xd1).chr(0x85); $cht=chr(0xcb).chr(0x88); $phword=~s/$chf/$cht/g;
+ $chf=chr(0xd0).chr(0xb3); $cht=chr(0xcb).chr(0x8c); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x5a); $cht=chr(0xca).chr(0x92); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x4e); $cht=chr(0xc5).chr(0x8b); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x53); $cht=chr(0xca).chr(0x83); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x44); $cht=chr(0xc3).chr(0xb0); $phword=~s/$chf/$cht/g;
+ $chf=chr(0x54); $cht=chr(0xce).chr(0xb8); $phword=~s/$chf/$cht/g;
+
+ return $phword;
+}
+
+