Re: [Moses-support] OT: LDC2004E12

Ham, Michael Sun, 13 Jul 2008 19:09:16 -0700

Those escape numbers are Unicode characters.  The Chinese character set
does not exist in ASCII, so you have to use UTF-8.


However, in addition to doing this, you also need to install a font that
can show Chinese characters.  One that I have gotten to work that you
may want to look into is the Bitstream Cyberbit font.  You can download
it here:
http://http.netscape.com.edgesuite.net/pub/communicator/extras/fonts/win
dows/Cyberbit.ZIP 

I hope this helps!
- Michael

------------------------------

Date: Fri, 11 Jul 2008 15:39:11 -0400
From: "John D. Burger" <[EMAIL PROTECTED]>
Subject: [Moses-support] OT: LDC2004E12
To: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

Sorry for the slightly off-topic message, but at least it's about MT:

We're using the UN Chinese-English Parallel Text collection  
(LDC2004E12) for some of our work.  It has lots of odd sequences of  
the form:

   \x{a37e}

I presume these are hex codes indicating escaped characters or  
something, but I'm not sure what.  Has anyone done anything with  
these, other than ignore or delete them?

Thanks.

- John Burger
   MITRE


------------------------------

Message: 2
Date: Sat, 12 Jul 2008 10:16:21 +0000 (UTC)
From: Vineet Kashyap <[EMAIL PROTECTED]>
Subject: [Moses-support] Unknown words
To: [email protected]
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=us-ascii

Hi all

1. is there a way to output unknown words to a separate
file instead of dropping them as i think we can add 
those words to the dictionary  which will improve the 
accuracy ?

2. also, when adding dictionary to the parallel corpus as 
suggested by Phillip in the previous post you have one
word in the source language and the other in the target 
language is that correct?

3. Does BLEU uses a reference file with accurate human 
translations to estimate a score ? And if not would it
be better to evaluate the system with such a reference file 
with accurate translations ? 

4. what value of BLEU means good translations ? in percentage...
   and for comparison purposes how would a human judge a MT system's
   performance ?

5. can we train higher order language models with SRILM with
a small corpus or have to use IRSTLM ?


Thanks a lot in advance for taking the time in answering these
questions.

Regards, Vineet



------------------------------

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 21, Issue 7
********************************************

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] OT: LDC2004E12

Reply via email to