Hello Lane,
frankly I don't see this as sooo desireable. You just exchange a magic
character with an even more magic one. Since the proposed character is
not an ASCII character you'll eventually run into encoding problems. And
for most people it'd be very difficult to type this character on the
keyboard and to distinguish it from the regular | symbol. It just gets
more and more obscure.
To really improve on the ugly "magic file format" issue I'd love to see
support for XML-based input and configuration files. There is tons of
tooling out there to handle XML files, there are no limitation in
respect to the content (even multi-line input would be possible). You
can easily check conformance (using a DTD) and you can keep them
backwards compatible if you desire so. Of course it's very well
understood that this is a major effort that's not easy to address.
just my two cents
Christof
PS: and yes, I spent substantial effort in making my tool chain pipe
proof. I'd hate to sift through all that again for no practical gain.
On 11/15/10 12:55 PM, Lane Schwartz wrote:
I'd like to propose changing the current factor delimiter to something
other than the single vertical bar |
Looking through the mailing archives, it seems that the failure to
properly purge your corpus of vertical bars is a frequent source of
headaches for users. I know I've encountered this problem before, but
even knowing that I should do this, just today I had to track down
another vertical bar-related problem.
I don't really care what the replacement character(s) ends up being,
just so that any corpus munging related to this delimiter gets handled
internally by moses rather than being the user's responsibility.
If moses could easily be modified to take a multi-character delimeter,
that would probably be best. My suggestion for a single-character
delimiter would be something with the following characteristics:
* Character should be printable (ie not a control character)
* Character should be one that's implemented in most commonly used fonts
* Character should be highly obscure, and extremely unlikely to appear
in a corpus
* Character should not be confusable with any commonly used character.
Many characters in the Dingbats section of Unicode (block 2700) would
fit these desiderata.
I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
highly obscure printable character that looks like a thick vertical
bar. It's obviously a vertical bar, but just as obviously not the same
thing as the regular vertical bar |.
Cheers,
Lane
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support