I don't know the tmx2txt.pl script, but I can suggest where to look for
problems.
The most frequent problem we have when extracting data from TMX files
comes from files that don't comply with the TMX specification,
especially regarding compliance with the srclang attributes. The spec
states this about how to identify the source language:
"/the <tuv> holding the source segment will have its xml:lang
attribute set to the same value as srclang. (except if srclang is
set to "*all*"). If a <tu> element does not have a srclang attribute
specified, it uses the one defined in the <header> element./"
Sadly, many TMX creation tools, including tools from SDL, do not
properly identify the source language. Each tool that looks for the
source language TUV according to the spec handles erroneous TMX segments
in its own way. So, you need to learn how your TMX declares the srclang
attribute, and then study the script to see where there's a mismatch.
You can see how we managed these sloppy TMX files in this post, only a
week old: https://pttools.freshdesk.com/discussions/topics/6000034251
Hope this helps.
Tom
On 3/12/2016 8:57 PM, [email protected] wrote:
Date: Sat, 12 Mar 2016 13:42:05 +0100
From: Sa?o Kuntaric<[email protected]>
Subject: [Moses-support] Preparing TMX files for use in Moses
To:[email protected]
Hi all,
I have a question that is not connected directly to Moses. I am trying to
prepare the corpora for training my engine. I have exported a few of my TMs
to the TMX format and now I am trying to create two separate UTF-8 text
files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I
get empty text files for both (the former tool claims that the input file
can't be read). Are there any special setting I need to set when extracting
the TMX files? I am using SDL Trados Studio 2015 for exporting the files.
Has anyone come across anything like this?
-- lp, Sa?o
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support