Don't use truecase, but it's like recase. I'd start there. Recase starts by preparing a monolingual corpus of target language.
On March 13, 2016 6:24:48 PM GMT+07:00, "Sašo Kuntaric" <[email protected]> wrote: >Thank you for your reply. > >It's one of those errors it's hard to admit one's mistake for, because >it's >so trivial, namely I mistyped the language name (EN-US instead of >en-US), >since I am mostly a Windows user. The script works fine now and I can >confirm it works well with Studio-exported TMX files. > >I do have another question regarding the training of the truecaser. In >the >example shown on the Moses homepage, a truecase-model.en file is used, >however it is downloaded with the example files. If I want to train my >truecaser for Slovenian, how do I get the truecase-model file. Is it >something I need to create myself and how do I go about and do it? > >Thanks in advance for the replies. > >Best regards, > >Sašo > >2016-03-13 12:03 GMT+01:00 Tom Hoar ><[email protected]>: > >> I don't know the tmx2txt.pl script, but I can suggest where to look >for >> problems. >> >> The most frequent problem we have when extracting data from TMX files >> comes from files that don't comply with the TMX specification, >especially >> regarding compliance with the srclang attributes. The spec states >this >> about how to identify the source language: >> >> "*the <tuv> holding the source segment will have its xml:lang >attribute >> set to the same value as srclang. (except if srclang is set to >"*all*"). If >> a <tu> element does not have a srclang attribute specified, it uses >the one >> defined in the <header> element.*" >> >> Sadly, many TMX creation tools, including tools from SDL, do not >properly >> identify the source language. Each tool that looks for the source >language >> TUV according to the spec handles erroneous TMX segments in its own >way. >> So, you need to learn how your TMX declares the srclang attribute, >and then >> study the script to see where there's a mismatch. >> >> You can see how we managed these sloppy TMX files in this post, only >a >> week old: https://pttools.freshdesk.com/discussions/topics/6000034251 >> >> Hope this helps. >> >> Tom >> >> >> On 3/12/2016 8:57 PM, [email protected] wrote: >> >> Date: Sat, 12 Mar 2016 13:42:05 +0100 >> From: Sa?o Kuntaric <[email protected]> ><[email protected]> >> Subject: [Moses-support] Preparing TMX files for use in Moses >> To: [email protected] >> >> Hi all, >> >> I have a question that is not connected directly to Moses. I am >trying to >> prepare the corpora for training my engine. I have exported a few of >my TMs >> to the TMX format and now I am trying to create two separate UTF-8 >text >> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl >tools. I >> get empty text files for both (the former tool claims that the input >file >> can't be read). Are there any special setting I need to set when >extracting >> the TMX files? I am using SDL Trados Studio 2015 for exporting the >files. >> >> Has anyone come across anything like this? >> >> -- >> lp, >> >> Sa?o >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > >-- >lp, > >Sašo
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
