Thank you all for the input. Here's my summary of comments and concerns: * People have invested considerable effort into toolchains and regression tests, and any changes should be sensitive to that fact * Most people don't use factors
* There is an existing flag that allows users to specify the factor delimiter * The status quo causes problems because users have to ensure their corpora are cleansed of certain meta-characters (vertical bar, especially) My view is that it would be great if Moses and its support scripts eventually could support arbitrary plain text input, not treating anything as a meta character. But, I know that this involves more effort than I'm willing to invest right now. I'm hoping that Miles's suggestion is something that can be agreed on. That is, make the default Moses behavior be *non* factored input. This should help prevent vertical bar related problems for most users, but (as far as I can see) should not disrupt existing toolchains, regression tests, or users who use factors. If you voted for the status quo, please let me know whether the above would satisfy your concerns. Thanks, Lane On Mon, Nov 15, 2010 at 3:55 PM, Lane Schwartz <[email protected]> wrote: > I'd like to propose changing the current factor delimiter to something > other than the single vertical bar | > > Looking through the mailing archives, it seems that the failure to properly > purge your corpus of vertical bars is a frequent source of headaches for > users. I know I've encountered this problem before, but even knowing that I > should do this, just today I had to track down another vertical bar-related > problem. > > I don't really care what the replacement character(s) ends up being, just > so that any corpus munging related to this delimiter gets handled internally > by moses rather than being the user's responsibility. > > If moses could easily be modified to take a multi-character delimeter, that > would probably be best. My suggestion for a single-character delimiter would > be something with the following characteristics: > > * Character should be printable (ie not a control character) > * Character should be one that's implemented in most commonly used fonts > * Character should be highly obscure, and extremely unlikely to appear in a > corpus > * Character should not be confusable with any commonly used character. > > Many characters in the Dingbats section of Unicode (block 2700) would fit > these desiderata. > > I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a > highly obscure printable character that looks like a thick vertical bar. > It's obviously a vertical bar, but just as obviously not the same thing as > the regular vertical bar |. > > Cheers, > Lane > -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love"
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
