Re: [Moses-support] EMS and data preprocessing

Philipp Koehn Tue, 25 May 2010 06:11:07 -0700

Hi,

I have been using lately truecasing instead of lowercasing for European
languages, since that handles unknown names better. There are also
some gains for instance for German, where uppercased nouns are handled
differently from lowercased verbs with the same spelling.


Try it for yourself, but be sure to evaluate with cased BLEU.

-phi

On Tue, May 25, 2010 at 12:54 PM, Suzy Howlett <[email protected]> wrote:
> Hi,
>
> Ah, I see where I went wrong. I'm used to lowercasing + recasing, and didn't
> realise what the truecaser did. Thanks for the help! Would you recommend
> using lowercase+recase or truecase+detruecase?
>
> Suzy
>
> On 25/05/10 8:47 PM, Philipp Koehn wrote:
>>
>> Hi Suzy,
>>
>> I could re-produce this error in a way that I assume is what you did.
>> You changed the specification of the CORPUS, but you did not
>> disable the truecaser.
>>
>> You need to comment out the following settings:
>>
>> [TRUECASER]
>>
>> ### script to train truecaser models
>> #
>> #trainer = $moses-script-dir/recaser/train-truecaser.perl
>>
>> [GENERAL]
>> # truecasers
>> #input-truecaser = $moses-script-dir/recaser/truecase.perl
>> #output-truecaser = $moses-script-dir/recaser/truecase.perl
>> #detruecaser = $moses-script-dir/recaser/detruecase.perl
>>
>> If these are not disabled, the script still thinks that it has to build
>> a truecaser model, and hence needs to find unpreprocessed data.
>>
>> -phi
>>
>>
>> On Tue, May 25, 2010 at 10:47 AM, Suzy Howlett<[email protected]>
>>  wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to run a system through the EMS where all of the
>>> preprocessing (tokenization, lowercasing) has already been done for all
>>> of the training, tuning and evaluation data. The intermediate steps are
>>> not available, and I just provide the ultimate lowercased data. In my
>>> config file I have e.g.
>>>
>>> [CORPUS:combined]
>>> lowercased-stem = $wmt10preproc-data/training/lowercased
>>>
>>> where the directory $wmt10preproc-data/training contains two files,
>>> lowercased.de and lowercased.en. The variables raw-stem, tokenized-stem,
>>> clean-stem are not set.
>>>
>>> However when I run the system, it looks like it's still trying to run
>>> the get-corpus/tokenize/clean steps - it produces files like
>>> steps/1/CORPUS_combined_get-corpus.1* which contain error messages about
>>> not being able to find files. What am I missing?
>>>
>>> Thanks,
>>> Suzy
>>>
>>> --
>>> Suzy Howlett
>>> http://web.science.mq.edu.au/~showlett/
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>
> --
> Suzy Howlett
> http://web.science.mq.edu.au/~showlett/
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] EMS and data preprocessing

Reply via email to