Hi,

I suggest moving the token-joining step to after the tokenization step.
(You need to know where the token boundaries are before you can remove them,
and once you remove token boundaries, you don't want to add new ones.)

Ben

On Tue, Jun 14, 2011 at 2:42 PM, <[email protected]> wrote:

>
> ---------- Forwarded message ----------
> From: Miles Osborne <[email protected]>
> To: Anna c <[email protected]>
> Date: Mon, 13 Jun 2011 18:50:07 +0100
> Subject: Re: [Moses-support] How to change phrase representation
> the simplest approach would be to use another character to join words
> together.  the tokeniser thinks you have hyphenated words, which is
> probably what you don't want.
>
> Miles
>
> On 13 June 2011 18:39, Anna c <[email protected]> wrote:
> > Hi,
> > I've tried what you suggested, but I'm not sure if I'm doing it right...
> > I've replaced all the occurrences in the input files as you said, adding
> a
> > '~' between the words (as in "the~man"), but when I see the file
> > training.tok.en or training.tok.es (resulting of the first steps in the
> > guide), the words have been separated and it appears as "the ~ man".
> Should
> > I change the tokenizer.perl to ignore the '~' or should I skip that
> steps?
> > Or it is correct in that way?
> >
> > Thank you very much!
> > Best regards,
> > Anna
> >
> >
> >
> >
> >> Date: Fri, 10 Jun 2011 10:48:07 +0100
> >> Subject: Re: [Moses-support] How to change phrase representation
> >> From: [email protected]
> >> To: [email protected]
> >> CC: [email protected]
> >>
> >> Hi,
> >>
> >> I am not entirely sure if I fully understand your question,
> >> but let me try to answer.
> >>
> >> the phrase-based model implementation considers tokens
> >> separated by a white space as a word. It does also learn
> >> translation entries for sequences of words ("phrases").
> >>
> >> If you want to group words into larger tokens, then you
> >> have to replace the white spaces.
> >>
> >> For instance, if you want to force the training setup and decoder
> >> to treat "the man" as a unit, then you should replace all
> >> occurrences (in training data and decoder input) with "the~man".
> >>
> >> -phi
> >>
> >> On Fri, Jun 10, 2011 at 10:38 AM, Anna c <[email protected]> wrote:
> >> > Hi!
> >> > I'm doing a master's degree and I need some help with one of my
> >> > subjects.
> >> > I've already installed GIZA++ and Moses correctly, and made the step
> by
> >> > step
> >> > guide of the web, checking that everything was ok. But I'm a newbie in
> >> > this
> >> > and I'm a bit lost. What I have to do is to change the representation
> so
> >> > the
> >> > basic unit won't be the word, but pairs or triplets of words, and
> >> > compare it
> >> > with the normal representation. How do I do that? Do I have to change
> >> > the
> >> > preparation step in the training?
> >> >
> >> > Thank you very much!
> >> > Best regards,
> >> > Anna
> >> >
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to