i second this. but can I make another suggestion. make the default be *non* factored input. i reckon that most people using Moses don't actually use factors (hands-up if you do). this means, plain input, with absolutely no meta chars in them.
and if you are going to use meta-chars, why not just have a flag such as: --factorDelimiter=| etc. Miles On 15 November 2010 21:30, Hieu Hoang <[email protected]> wrote: > That's a good idea. In the decoder, there's 4 places that has to be > changed cos it's hardcoded > ConfusionNet > GenerationDictionary > LanguageModelJoint > Word::createFromString > > However, the train-model.perl is more difficult to change > > Hieu > Sent from my flying horse > > On 15 Nov 2010, at 09:00 PM, Lane Schwartz <[email protected]> wrote: > >> I'd like to propose changing the current factor delimiter to something other >> than the single vertical bar | >> >> Looking through the mailing archives, it seems that the failure to properly >> purge your corpus of vertical bars is a frequent source of headaches for >> users. I know I've encountered this problem before, but even knowing that I >> should do this, just today I had to track down another vertical bar-related >> problem. >> >> I don't really care what the replacement character(s) ends up being, just so >> that any corpus munging related to this delimiter gets handled internally by >> moses rather than being the user's responsibility. >> >> If moses could easily be modified to take a multi-character delimeter, that >> would probably be best. My suggestion for a single-character delimiter would >> be something with the following characteristics: >> >> * Character should be printable (ie not a control character) >> * Character should be one that's implemented in most commonly used fonts >> * Character should be highly obscure, and extremely unlikely to appear in a >> corpus >> * Character should not be confusable with any commonly used character. >> >> Many characters in the Dingbats section of Unicode (block 2700) would fit >> these desiderata. >> >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly >> obscure printable character that looks like a thick vertical bar. It's >> obviously a vertical bar, but just as obviously not the same thing as the >> regular vertical bar |. >> >> Cheers, >> Lane >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
