Re: [Moses-support] German compound splitter

Tom Hoar Wed, 01 Feb 2017 05:31:09 -0800

Thanks, Rico. Very helpful.

I already started walking through the code and gutting what I don'tneed. When I'm done, it'll be a pure unsupervised splitter withoutsyntax or SMOR support, probably similar to the existingcompound-splitter.perl script, but in Python.

I'm refactoring the code so a Splitter() class does all the work. Userscan import the class into other Python scripts. A call to`Splitter.split_compounds(line)` splits a single line. From acommand-line executable perspective, it'll continue to function like theexisting script with all the same arguments (less the ones supportingSMOR and syntax). A `-train` argument creates the model. The `-corpus`argument reads either a path to a UTF-8 file or piped STDIN. The scriptiterates lines and prints UTF-8 to STDOUT. There's one difference. Thisversion makes/reads only raw JSON model files, not JSON saved as anPython module to import. That extra code seemed unnecessary since jsonis a standard library.

I'll return the update to you. If the Moses team wants, I'm happy tocontribute it. I have a few questions about your comments.

RE "with '-no-truecase', the original case is kept," it looks like"original case" means the original text being split, not the originaltext in the model?


RE "-write-filler," I'll have to play with it to see how it works.

RE "we never tested this on a phrase-based system," I'm betting that ona phrase-based system, any splitting is better than none. Our customersuse their own translation memories (80K to 150K pairs) to create SMTmodels for their private work. In non-DE (RU, FR, ES and others)language use cases, those SMT models create 40% (or more) suggestionsthat are exactly correct in double-blind tests. That's more stringentthan edit-distance zero because the translators aren't influenced bypost-editing. The same source text goes to a human and the engine. Whencomparing the totally independent results, number of "exactly the same"translated segments is very high. Note that these results are atestament to (a) the Moses team for making such a great tool, and (b)the translators for having superb TMs. Of course, we take a littlecredit for making Moses accessible to the customers with Slate Desktopon Windows and Linux. Sadly, our DE customers experience less than 25%correct. So like I said, any DE splitting is bound to improve theresults. I'll let you guys know.


Tom



On 2/1/2017 6:26 PM, [email protected] wrote:

Date: Wed, 1 Feb 2017 11:19:51 +0000 From: Rico Sennrich<[email protected]> Subject: Re: [Moses-support] German compoundsplitter To: [email protected] Hello Tom, 1. no stemming isapplied, only splitting - we used it on the target side for ourEnglish->German system, and no information is lost. 2. the truecasingmodel will make each segment upper-/lowercased depending on which ismore frequent in the training data. with '-no-truecase', the originalcase is kept. 3. the exact string depends on whether the word has a"Fugenelement" like "-n", "-e", "-es", "-s", and "-". Here's anexample of how "Geburtstag" (birthday) is split (if -max-count is highenough): default: Geburt tag -write-filler: Geburt @s@ tag-merge-filler: Geburts@@ tag if there is no Fugenelement, then yes, @@is inserted with -write-filler: Geburttag -> Geburt @@ tag 4. theinput should be tokenized, but not lowercased. If you want to applylowercasing, you can do this after splitting. For re-joining thesplits for the final system, we simple used a regex on the fillerelements: sed -r 's/ \@(\S*?)\@ /\1/g' | sed -r 's/\@\@ //g'" Notethat we never tested this on a phrase-based system, and there might bemore spurious reorderings in a phrase-based system than in ourstring-to-tree system in which we used this. best wishes, Rico On01/02/17 01:36, Tom Hoar wrote:
I'm sharing some feedback and asking new question.

I tried the SoMaJo German tokenizer. After considerable work with some
customers, we concluded it does not work as well for SMT as the
built-in Moses tokenizer.perl with German. So, back to the drawing board.

Rico, I'm revisiting your hybrid splitter and have some questions.

  1. Are stemmed tokens in the output or only original tokens simply
     split? It seems for SMT support, not stemming is applied. I just
     want to verify because I can not use stemmed output.

  2. I need the split output to be natural cased, i.e. not lower-cased.
     Is this the purpose of the `-no-truecase` argument?

  3. Can you confirm that the `-write-filler` argument marks the split
     using " @@ "?

  4. The command to train a model is simple enough:

     `hybrid_compound_splitter.py -train -syntax -corpus INPUT_FILE
     -model MODEL_FILE`

     What state is German INPUT_FILE ? i.e. tokenized or not?
     lower-cased or not?

In a separate but similar line, what is the current state of the art
in using compound-split corpus in the target language and then
re-joining the splits with proper casing for a final rendering?


Thanks!
Tom


On 8/26/2016 9:15 AM,[email protected]  wrote:
Date: Thu, 25 Aug 2016 09:05:13 -0700
From: Tom Hoar<[email protected]>
Subject: Re: [Moses-support] German compound splitter
To:"[email protected]"   <[email protected]>

Thank you, Rico! Looks promising.
I found this one on Python's Pypi repository:https://pypi.python.org/pypi/SoMaJo/1.1.2
Does anyone have any experience with it?

Tom



On 8/25/2016 11:01 PM,[email protected]   wrote:
Date: Wed, 24 Aug 2016 17:23:22 +0100
From: Rico Sennrich<[email protected]>
Subject: Re: [Moses-support] German compound splitter
To:[email protected]

Hi Tom,

I've been using this one for the Edinburgh WMT submission (EN-DE
syntax-based) in the last 3 years:
https://github.com/rsennrich/wmt2014-scripts/blob/master/hybrid_compound_splitter.py
It implements the hybrid (frequency-based and FST-based) algorithm by
Fritzinger & Fraser 2010: "How to Avoid Burning Ducks: Combining
Linguistic Analysis and Corpus Statistics for German Compound Processing"

best wishes,
Rico

On 24 August 2016 at 09:14, Tom Hoar<[email protected]>   wrote:
Does anyone recommend a German compound splitter? I know it's been
discussed here before. Thanks.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


--
Best regards,

Tom Hoar
Chief Executive Officer
*/Precision Translation Tools Pte Ltd/*
Singapore/Thailand

Web: www.precisiontranslationtools.com<http://www.precisiontranslationtools.com>

Thailand Mobile: +66 87 345-1875
Skype call: tahoar <skype:tahoar?call>
Skype chat: tahoar <skype:tahoar>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] German compound splitter

Reply via email to