Hi Per, I would like to ask you if this runs in Linux.
I work in Ubuntu and I am trying to convert TMX to moses files to train my system. Thanks. Ricardo 2016-03-14 9:05 GMT+01:00 Per Tunedal <[email protected]>: > Hi, > I had some problems with TMX extraction scripts and wrote my own. You > might find it useful: > > https://github.com/havet/TMX2Moses > > It simply disregards the specification in the header and reads the source > and target language from the <tu> elements. > > Works on single TMX-files as well as on folders containing TMX-files. > > Yours, > Per Tunedal > > On Sun, Mar 13, 2016, at 12:03, Tom Hoar wrote: > > I don't know the tmx2txt.pl script, but I can suggest where to look for > problems. > > The most frequent problem we have when extracting data from TMX files > comes from files that don't comply with the TMX specification, especially > regarding compliance with the srclang attributes. The spec states this > about how to identify the source language: > > > "*the <tuv> holding the source segment will have its xml:lang attribute > set to the same value as srclang. (except if srclang is set to "*all*"). If > a <tu> element does not have a srclang attribute specified, it uses the one > defined in the <header> element.*" > > Sadly, many TMX creation tools, including tools from SDL, do not properly > identify the source language. Each tool that looks for the source language > TUV according to the spec handles erroneous TMX segments in its own way. > So, you need to learn how your TMX declares the srclang attribute, and then > study the script to see where there's a mismatch. > > You can see how we managed these sloppy TMX files in this post, only a > week old: <https://pttools.freshdesk.com/discussions/topics/6000034251> > https://pttools.freshdesk.com/discussions/topics/6000034251 > > Hope this helps. > > Tom > > > > On 3/12/2016 8:57 PM, [email protected] wrote: > > Date: Sat, 12 Mar 2016 13:42:05 +0100 > From: Sa?o Kuntaric <[email protected]> <[email protected]> > Subject: [Moses-support] Preparing TMX files for use in Moses > To: [email protected] > > Hi all, > > I have a question that is not connected directly to Moses. I am trying to > prepare the corpora for training my engine. I have exported a few of my TMs > to the TMX format and now I am trying to create two separate UTF-8 text > files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I > get empty text files for both (the former tool claims that the input file > can't be read). Are there any special setting I need to set when extracting > the TMX files? I am using SDL Trados Studio 2015 for exporting the files. > > Has anyone come across anything like this? > > -- > lp, > > Sa?o > > > *_______________________________________________* > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
