Hi,
I had some problems with TMX extraction scripts and wrote my own. You might 
find it useful:
 
https://github.com/havet/TMX2Moses
 
It simply disregards the specification in the header and reads the
source and target language from the <tu> elements.
 
Works on single TMX-files as well as on folders containing TMX-files.
 
Yours,
Per Tunedal
 
On Sun, Mar 13, 2016, at 12:03, Tom Hoar wrote:
> I don't know the tmx2txt.pl script, but I can suggest where to look
    for problems.
>
>
    The most frequent problem we have when extracting data from TMX
    files comes from files that don't comply with the TMX specification,
    especially regarding compliance with the srclang attributes. The
    spec states this about how to identify the source language:
>
>> "*the <tuv> holding the source segment will have
        its xml:lang attribute set to the same value as srclang. (except
        if srclang is set to "*all*"). If a <tu> element does not have a
        srclang attribute specified, it uses the one defined in the
        <header> element.*"
> Sadly, many TMX creation tools, including tools from SDL, do not
    properly identify the source language. Each tool that looks for the
    source language TUV according to the spec handles erroneous TMX
    segments in its own way. So, you need to learn how your TMX declares
    the srclang attribute, and then study the script to see where
    there's a mismatch.
>
>
    You can see how we managed these sloppy TMX files in this post, only
    a week old:
    https://pttools.freshdesk.com/discussions/topics/6000034251
>
>
    Hope this helps.
>
>
    Tom
>
>
>
> On 3/12/2016 8:57 PM,
      [email protected] wrote:
>> Date: Sat, 12 Mar 2016 13:42:05 +0100
From: Sa?o Kuntaric <[email protected]>
Subject: [Moses-support] Preparing TMX files for use in Moses
To: [email protected]

Hi all,

I have a question that is not connected directly to Moses. I am trying
to prepare the corpora for training my engine. I have exported a few of
my TMs to the TMX format and now I am trying to create two separate UTF-
8 text files. I have tried it with the extract-tmx-corpus and tmx2txt.pl
tools. I get empty text files for both (the former tool claims that the
input file can't be read). Are there any special setting I need to set
when extracting the TMX files? I am using SDL Trados Studio 2015 for
exporting the files.

Has anyone come across anything like this?

>>
>>
>> --
lp,

Sa?o
>>
>
> _________________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to