Hi, I had some problems with TMX extraction scripts and wrote my own. You might find it useful: https://github.com/havet/TMX2Moses It simply disregards the specification in the header and reads the source and target language from the <tu> elements. Works on single TMX-files as well as on folders containing TMX-files. Yours, Per Tunedal On Sun, Mar 13, 2016, at 12:03, Tom Hoar wrote: > I don't know the tmx2txt.pl script, but I can suggest where to look for problems. > > The most frequent problem we have when extracting data from TMX files comes from files that don't comply with the TMX specification, especially regarding compliance with the srclang attributes. The spec states this about how to identify the source language: > >> "*the <tuv> holding the source segment will have its xml:lang attribute set to the same value as srclang. (except if srclang is set to "*all*"). If a <tu> element does not have a srclang attribute specified, it uses the one defined in the <header> element.*" > Sadly, many TMX creation tools, including tools from SDL, do not properly identify the source language. Each tool that looks for the source language TUV according to the spec handles erroneous TMX segments in its own way. So, you need to learn how your TMX declares the srclang attribute, and then study the script to see where there's a mismatch. > > You can see how we managed these sloppy TMX files in this post, only a week old: https://pttools.freshdesk.com/discussions/topics/6000034251 > > Hope this helps. > > Tom > > > > On 3/12/2016 8:57 PM, [email protected] wrote: >> Date: Sat, 12 Mar 2016 13:42:05 +0100 From: Sa?o Kuntaric <[email protected]> Subject: [Moses-support] Preparing TMX files for use in Moses To: [email protected]
Hi all, I have a question that is not connected directly to Moses. I am trying to prepare the corpora for training my engine. I have exported a few of my TMs to the TMX format and now I am trying to create two separate UTF- 8 text files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I get empty text files for both (the former tool claims that the input file can't be read). Are there any special setting I need to set when extracting the TMX files? I am using SDL Trados Studio 2015 for exporting the files. Has anyone come across anything like this? >> >> >> -- lp, Sa?o >> > > _________________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
