Re: Errors in Unihan data : simplified/traditional variants

2010-11-01 Thread John H. Jenkins

On 2010/10/30, at 下午8:42, Koxinga wrote:

 My quickly done parsing program counted 1154 such pairs, where the head 
 character was the same as the character above. It seems to be always in the 
 order kTraditionalVariant then kSimplifiedVariant, so can maybe be 
 automatically corrected. It seems to be a very evident mistake, and the 
 correction should be easy. I can help with that, I am just waiting to see if 
 this is the right place to report problems in Unihan. I also 
 consideredhttp://www.unicode.org/reporting.html , would it be better ?
 

Yes, that would be better.  That way it will be tracked and it's less likely to 
slip through the cracks in my schedule.  For general questions, you can email 
me directly.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Errors in Unihan data : simplified/traditional variants

2010-10-31 Thread Koxinga

Hello,

I recently looked up the relationships traditional-simplified in the 
Unihan database (Unihan_Variants.txt).


I knew it had mistakes and I wanted to help correct some of them, but 
the first thing that stand out and surprised me was the large number of 
lines like :


U+346F  kSimplifiedVariant  U+3454
U+346F  kTraditionalVariant U+3454

which should be (if I didn't mix them up ...)

U+3454  kTraditionalVariant  U+346F
U+346F  kSimplifiedVariant U+3454

My quickly done parsing program counted 1154 such pairs, where the head 
character was the same as the character above. It seems to be always in 
the order kTraditionalVariant then kSimplifiedVariant, so can maybe 
be automatically corrected. It seems to be a very evident mistake, and 
the correction should be easy. I can help with that, I am just waiting 
to see if this is the right place to report problems in Unihan. I also 
considered http://www.unicode.org/reporting.html , would it be better ?


I have a lot of other questions and comments on these 
simplified/traditional relationships, but I guess it will wait the 
resolution of this problem, this would make for a too long email.


Regards,

Koxinga






Errors in Unihan?

2000-11-14 Thread Pierpaolo Bernardi


Hello,

In the Unihan.txt database, in the kMandarin field there are entries
with duplicate pronunciations. For example:

U+4E21  kMandarin   1 LIANG3 2 LIANG3 3 LIANG4
U+4E4E  kMandarin   1 HU1 HU2 2 HU1
U+4E86  kMandarin   1 LIAO3 2 LE LIAO3

Is there a reason for these duplicates? If this is the case, the
format of this field should be documented better in the header. If
these duplications are errors, I can supply a list of them.

Also, what's the meaning of the isolated numbers?



Other entries certainly contains errors, for example:

U+5594  kMandarin   1 WO1 2 01
^ this is zero.

U+4EC0  kMandarin   1 SHI2 2 SHEN2 3 SHI2 SHIU2SHEN2 SHI2
   ?? -- shi2 shen2 ??

Regards,
  Pierpaolo Bernardi



Re: Errors in Unihan?

2000-11-14 Thread John Jenkins


On Tuesday, November 14, 2000, at 08:24 AM, Pierpaolo Bernardi wrote:

 In the Unihan.txt database, in the kMandarin field there are entries
 with duplicate pronunciations. For example:
 
 U+4E21kMandarin   1 LIANG3 2 LIANG3 3 LIANG4
 U+4E4EkMandarin   1 HU1 HU2 2 HU1
 U+4E86kMandarin   1 LIAO3 2 LE LIAO3
 
 Is there a reason for these duplicates? If this is the case, the
 format of this field should be documented better in the header. If
 these duplications are errors, I can supply a list of them.
 

That would be very helpful, yes.  

 Also, what's the meaning of the isolated numbers?
 

The value of the field was obtained from dictionaries.  When a dictionary provides 
more than one meaning, it is not infrequent that one pronunciation is specific to a 
particular meaning and another pronunciation specific to another.  This is where the 
numbers come from.

Inasmuch as the database doesn't maintain the link between specific definitions and 
pronunciations, the isolated numbers should also be removed.