Re: [Moses-support] German compound splitter

Rico Sennrich Wed, 01 Feb 2017 03:22:42 -0800

Hello Tom,

1. no stemming is applied, only splitting - we used it on the targetside for our English->German system, and no information is lost.

2. the truecasing model will make each segment upper-/lowercaseddepending on which is more frequent in the training data. with'-no-truecase', the original case is kept.

3. the exact string depends on whether the word has a "Fugenelement"like "-n", "-e", "-es", "-s", and "-". Here's an example of how"Geburtstag" (birthday) is split (if -max-count is high enough):


default: Geburt tag
-write-filler: Geburt @s@ tag
-merge-filler: Geburts@@ tag

if there is no Fugenelement, then yes, @@ is inserted with -write-filler:
Geburttag -> Geburt @@ tag

4. the input should be tokenized, but not lowercased. If you want toapply lowercasing, you can do this after splitting.

For re-joining the splits for the final system, we simple used a regexon the filler elements:

sed -r 's/ \@(\S*?)\@ /\1/g' | sed -r 's/\@\@ //g'"

Note that we never tested this on a phrase-based system, and there mightbe more spurious reorderings in a phrase-based system than in ourstring-to-tree system in which we used this.


best wishes,
Rico

On 01/02/17 01:36, Tom Hoar wrote:


I'm sharing some feedback and asking new question.

I tried the SoMaJo German tokenizer. After considerable work with somecustomers, we concluded it does not work as well for SMT as thebuilt-in Moses tokenizer.perl with German. So, back to the drawing board.


Rico, I'm revisiting your hybrid splitter and have some questions.

 1. Are stemmed tokens in the output or only original tokens simply
    split? It seems for SMT support, not stemming is applied. I just
    want to verify because I can not use stemmed output.

 2. I need the split output to be natural cased, i.e. not lower-cased.
    Is this the purpose of the `-no-truecase` argument?

 3. Can you confirm that the `-write-filler` argument marks the split
    using " @@ "?

 4. The command to train a model is simple enough:

    `hybrid_compound_splitter.py -train -syntax -corpus INPUT_FILE
    -model MODEL_FILE`

    What state is German INPUT_FILE ? i.e. tokenized or not?
    lower-cased or not?

In a separate but similar line, what is the current state of the artin using compound-split corpus in the target language and thenre-joining the splits with proper casing for a final rendering?



Thanks!
Tom


On 8/26/2016 9:15 AM, [email protected] wrote:

Date: Thu, 25 Aug 2016 09:05:13 -0700
From: Tom Hoar<[email protected]>
Subject: Re: [Moses-support] German compound splitter
To:"[email protected]"  <[email protected]>

Thank you, Rico! Looks promising.

I found this one on Python's Pypi repository:https://pypi.python.org/pypi/SoMaJo/1.1.2

Does anyone have any experience with it?

Tom



On 8/25/2016 11:01 PM,[email protected]  wrote:

Date: Wed, 24 Aug 2016 17:23:22 +0100
From: Rico Sennrich<[email protected]>
Subject: Re: [Moses-support] German compound splitter
To:[email protected]

Hi Tom,

I've been using this one for the Edinburgh WMT submission (EN-DE
syntax-based) in the last 3 years:

https://github.com/rsennrich/wmt2014-scripts/blob/master/hybrid_compound_splitter.py

It implements the hybrid (frequency-based and FST-based) algorithm by
Fritzinger & Fraser 2010: "How to Avoid Burning Ducks: Combining
Linguistic Analysis and Corpus Statistics for German Compound Processing"

best wishes,
Rico

On 24 August 2016 at 09:14, Tom Hoar<[email protected]>  wrote:

Does anyone recommend a German compound splitter? I know it's been
discussed here before. Thanks.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] German compound splitter

Reply via email to