My impression has always been that the pipe character is a fairly common choice across the field for delimiters like this, possibly to the point of being a standard, which makes it seem unwise to arbitrarily choose some other character. Perhaps it would be more useful to channel the effort into generic tools to help pipe-proof input and pipelines for arbitrary applications (not that I have any useful suggestions for how to do so).
I also like the suggestion of having to explicitly turn on the factor delimiter, and the command-line arg to use a character other than the default pipe. I don't know about the xml-based input, so can't comment on that. S. On 16/11/10 8:54 AM, Christian Hardmeier wrote: > I fully agree with Miles. > > In my opinion, replacing the pipe with an exotic Unicode character is > bad because > - in a web-crawled corpus, any Unicode character might occur, however > exotic it is. If it's exotic, it will be even harder to track down > the problem when it occurs. > - it assumes that everybody is using UTF-8, which I don't think is true. > I know people working with Latin-1 encoded corpora, and for all I > know, somebody out there may be using an encoding in which the bytes > encoding "exotic UTF-8 character of your choice" in fact encode a > very common letter or sign. Using a character from the ASCII subset > reduces dependence on particular encodings as far as possible. > > I like Miles's suggestion of not having a factor delimiter at all unless > explicitly turned on. If that's too complicated, I think we should stick > to the current situation, so at least we know the problems and how to > fix them, and, as Christof pointed out, some people may already have > tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd > hate to change it). > > /Christian > > On Mon, 15 Nov 2010, Miles Osborne wrote: > >> i second this. >> >> but can I make another suggestion. make the default be *non* factored >> input. i reckon that most people using Moses don't actually use >> factors (hands-up if you do). >> this means, plain input, with absolutely no meta chars in them. >> >> and if you are going to use meta-chars, why not just have a flag such as: >> >> --factorDelimiter=| >> >> etc. >> >> Miles >> >> On 15 November 2010 21:30, Hieu Hoang<[email protected]> wrote: >>> That's a good idea. In the decoder, there's 4 places that has to be >>> changed cos it's hardcoded >>> ConfusionNet >>> GenerationDictionary >>> LanguageModelJoint >>> Word::createFromString >>> >>> However, the train-model.perl is more difficult to change >>> >>> Hieu >>> Sent from my flying horse >>> >>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz<[email protected]> wrote: >>> >>>> I'd like to propose changing the current factor delimiter to something >>>> other than the single vertical bar | >>>> >>>> Looking through the mailing archives, it seems that the failure to >>>> properly purge your corpus of vertical bars is a frequent source of >>>> headaches for users. I know I've encountered this problem before, but even >>>> knowing that I should do this, just today I had to track down another >>>> vertical bar-related problem. >>>> >>>> I don't really care what the replacement character(s) ends up being, just >>>> so that any corpus munging related to this delimiter gets handled >>>> internally by moses rather than being the user's responsibility. >>>> >>>> If moses could easily be modified to take a multi-character delimeter, >>>> that would probably be best. My suggestion for a single-character >>>> delimiter would be something with the following characteristics: >>>> >>>> * Character should be printable (ie not a control character) >>>> * Character should be one that's implemented in most commonly used fonts >>>> * Character should be highly obscure, and extremely unlikely to appear in >>>> a corpus >>>> * Character should not be confusable with any commonly used character. >>>> >>>> Many characters in the Dingbats section of Unicode (block 2700) would fit >>>> these desiderata. >>>> >>>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly >>>> obscure printable character that looks like a thick vertical bar. It's >>>> obviously a vertical bar, but just as obviously not the same thing as the >>>> regular vertical bar |. >>>> >>>> Cheers, >>>> Lane >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- Suzy Howlett http://www.showlett.id.au/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
