In the scripts directory, several places have | hardcoded. If someone wants
to replace it, egrep -R "['\"\\]\\|" * can help spot them (it checks | after
a quote or a \ ; there is also a lot of unrelated things, and probably some
places are not spotted).

I think having the possibility to choose the factor delimiter would be nice
(right now this possibility exists but is not completely implemented).
Ideally, choosing what to have as token delimiter (space) and field
delimiter (|||) could be good too.

I wouldn't like to move to xml, because I enjoy having simple easily
readable phrase tables.

-- 
Raphael Payen


-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Hieu Hoang
Sent: 15 November 2010 21:31
To: Lane Schwartz
Cc: [email protected]
Subject: Re: [Moses-support] Proposal to replace vertical bar as factor
delimeter

That's a good idea. In the decoder, there's 4 places that has to be
changed cos it's hardcoded
   ConfusionNet
    GenerationDictionary
   LanguageModelJoint
    Word::createFromString

However, the train-model.perl is more difficult to change

Hieu
Sent from my flying horse

On 15 Nov 2010, at 09:00 PM, Lane Schwartz <[email protected]> wrote:

> I'd like to propose changing the current factor delimiter to something
other than the single vertical bar |
>
> Looking through the mailing archives, it seems that the failure to
properly purge your corpus of vertical bars is a frequent source of
headaches for users. I know I've encountered this problem before, but even
knowing that I should do this, just today I had to track down another
vertical bar-related problem.
>
> I don't really care what the replacement character(s) ends up being, just
so that any corpus munging related to this delimiter gets handled internally
by moses rather than being the user's responsibility.
>
> If moses could easily be modified to take a multi-character delimeter,
that would probably be best. My suggestion for a single-character delimiter
would be something with the following characteristics:
>
> * Character should be printable (ie not a control character)
> * Character should be one that's implemented in most commonly used fonts
> * Character should be highly obscure, and extremely unlikely to appear in
a corpus
> * Character should not be confusable with any commonly used character.
>
> Many characters in the Dingbats section of Unicode (block 2700) would fit
these desiderata.
>
> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly
obscure printable character that looks like a thick vertical bar. It's
obviously a vertical bar, but just as obviously not the same thing as the
regular vertical bar |.
>
> Cheers,
> Lane
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to