I'd like to propose changing the current factor delimiter to something other
than the single vertical bar |

Looking through the mailing archives, it seems that the failure to properly
purge your corpus of vertical bars is a frequent source of headaches for
users. I know I've encountered this problem before, but even knowing that I
should do this, just today I had to track down another vertical bar-related
problem.

I don't really care what the replacement character(s) ends up being, just so
that any corpus munging related to this delimiter gets handled internally by
moses rather than being the user's responsibility.

If moses could easily be modified to take a multi-character delimeter, that
would probably be best. My suggestion for a single-character delimiter would
be something with the following characteristics:

* Character should be printable (ie not a control character)
* Character should be one that's implemented in most commonly used fonts
* Character should be highly obscure, and extremely unlikely to appear in a
corpus
* Character should not be confusable with any commonly used character.

Many characters in the Dingbats section of Unicode (block 2700) would fit
these desiderata.

I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
highly obscure printable character that looks like a thick vertical bar.
It's obviously a vertical bar, but just as obviously not the same thing as
the regular vertical bar |.

Cheers,
Lane
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to