Re: [Moses-support] Preparing TMX files for use in Moses

Tom Hoar Sun, 13 Mar 2016 05:44:55 -0700

Don't use truecase, but it's like recase. I'd start there. Recase starts by 
preparing a monolingual corpus of target language.



On March 13, 2016 6:24:48 PM GMT+07:00, "Sašo Kuntaric" 
<[email protected]> wrote:
>Thank you for your reply.
>
>It's one of those errors it's hard to admit one's mistake for, because
>it's
>so trivial, namely I mistyped the language name (EN-US instead of
>en-US),
>since I am mostly a Windows user. The script works fine now and I can
>confirm it works well with Studio-exported TMX files.
>
>I do have another question regarding the training of the truecaser. In
>the
>example shown on the Moses homepage, a truecase-model.en file is used,
>however it is downloaded with the example files. If I want to train my
>truecaser for Slovenian, how do I get the truecase-model file. Is it
>something I need to create myself and how do I go about and do it?
>
>Thanks in advance for the replies.
>
>Best regards,
>
>Sašo
>
>2016-03-13 12:03 GMT+01:00 Tom Hoar
><[email protected]>:
>
>> I don't know the tmx2txt.pl script, but I can suggest where to look
>for
>> problems.
>>
>> The most frequent problem we have when extracting data from TMX files
>> comes from files that don't comply with the TMX specification,
>especially
>> regarding compliance with the srclang attributes. The spec states
>this
>> about how to identify the source language:
>>
>> "*the <tuv> holding the source segment will have its xml:lang
>attribute
>> set to the same value as srclang. (except if srclang is set to
>"*all*"). If
>> a <tu> element does not have a srclang attribute specified, it uses
>the one
>> defined in the <header> element.*"
>>
>> Sadly, many TMX creation tools, including tools from SDL, do not
>properly
>> identify the source language. Each tool that looks for the source
>language
>> TUV according to the spec handles erroneous TMX segments in its own
>way.
>> So, you need to learn how your TMX declares the srclang attribute,
>and then
>> study the script to see where there's a mismatch.
>>
>> You can see how we managed these sloppy TMX files in this post, only
>a
>> week old: https://pttools.freshdesk.com/discussions/topics/6000034251
>>
>> Hope this helps.
>>
>> Tom
>>
>>
>> On 3/12/2016 8:57 PM, [email protected] wrote:
>>
>> Date: Sat, 12 Mar 2016 13:42:05 +0100
>> From: Sa?o Kuntaric <[email protected]>
><[email protected]>
>> Subject: [Moses-support] Preparing TMX files for use in Moses
>> To: [email protected]
>>
>> Hi all,
>>
>> I have a question that is not connected directly to Moses. I am
>trying to
>> prepare the corpora for training my engine. I have exported a few of
>my TMs
>> to the TMX format and now I am trying to create two separate UTF-8
>text
>> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl
>tools. I
>> get empty text files for both (the former tool claims that the input
>file
>> can't be read). Are there any special setting I need to set when
>extracting
>> the TMX files? I am using SDL Trados Studio 2015 for exporting the
>files.
>>
>> Has anyone come across anything like this?
>>
>> --
>> lp,
>>
>> Sa?o
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
>-- 
>lp,
>
>Sašo

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Preparing TMX files for use in Moses

Reply via email to