Those escape numbers are Unicode characters. The Chinese character set does not exist in ASCII, so you have to use UTF-8.
However, in addition to doing this, you also need to install a font that can show Chinese characters. One that I have gotten to work that you may want to look into is the Bitstream Cyberbit font. You can download it here: http://http.netscape.com.edgesuite.net/pub/communicator/extras/fonts/win dows/Cyberbit.ZIP I hope this helps! - Michael ------------------------------ Date: Fri, 11 Jul 2008 15:39:11 -0400 From: "John D. Burger" <[EMAIL PROTECTED]> Subject: [Moses-support] OT: LDC2004E12 To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Sorry for the slightly off-topic message, but at least it's about MT: We're using the UN Chinese-English Parallel Text collection (LDC2004E12) for some of our work. It has lots of odd sequences of the form: \x{a37e} I presume these are hex codes indicating escaped characters or something, but I'm not sure what. Has anyone done anything with these, other than ignore or delete them? Thanks. - John Burger MITRE ------------------------------ Message: 2 Date: Sat, 12 Jul 2008 10:16:21 +0000 (UTC) From: Vineet Kashyap <[EMAIL PROTECTED]> Subject: [Moses-support] Unknown words To: [email protected] Message-ID: <[EMAIL PROTECTED]> Content-Type: text/plain; charset=us-ascii Hi all 1. is there a way to output unknown words to a separate file instead of dropping them as i think we can add those words to the dictionary which will improve the accuracy ? 2. also, when adding dictionary to the parallel corpus as suggested by Phillip in the previous post you have one word in the source language and the other in the target language is that correct? 3. Does BLEU uses a reference file with accurate human translations to estimate a score ? And if not would it be better to evaluate the system with such a reference file with accurate translations ? 4. what value of BLEU means good translations ? in percentage... and for comparison purposes how would a human judge a MT system's performance ? 5. can we train higher order language models with SRILM with a small corpus or have to use IRSTLM ? Thanks a lot in advance for taking the time in answering these questions. Regards, Vineet ------------------------------ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 21, Issue 7 ******************************************** _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
