My impression has always been that the pipe character is a fairly common 
choice across the field for delimiters like this, possibly to the point 
of being a standard, which makes it seem unwise to arbitrarily choose 
some other character. Perhaps it would be more useful to channel the 
effort into generic tools to help pipe-proof input and pipelines for 
arbitrary applications (not that I have any useful suggestions for how 
to do so).

I also like the suggestion of having to explicitly turn on the factor 
delimiter, and the command-line arg to use a character other than the 
default pipe. I don't know about the xml-based input, so can't comment 
on that.

S.

On 16/11/10 8:54 AM, Christian Hardmeier wrote:
> I fully agree with Miles.
>
> In my opinion, replacing the pipe with an exotic Unicode character is
> bad because
> - in a web-crawled corpus, any Unicode character might occur, however
>    exotic it is. If it's exotic, it will be even harder to track down
>    the problem when it occurs.
> - it assumes that everybody is using UTF-8, which I don't think is true.
>    I know people working with Latin-1 encoded corpora, and for all I
>    know, somebody out there may be using an encoding in which the bytes
>    encoding "exotic UTF-8 character of your choice" in fact encode a
>    very common letter or sign. Using a character from the ASCII subset
>    reduces dependence on particular encodings as far as possible.
>
> I like Miles's suggestion of not having a factor delimiter at all unless
> explicitly turned on. If that's too complicated, I think we should stick
> to the current situation, so at least we know the problems and how to
> fix them, and, as Christof pointed out, some people may already have
> tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd
> hate to change it).
>
> /Christian
>
> On Mon, 15 Nov 2010, Miles Osborne wrote:
>
>> i second this.
>>
>> but can I make another suggestion.  make the default be *non* factored
>> input.  i reckon that most people using Moses don't actually use
>> factors (hands-up if you do).
>> this means, plain input, with absolutely no meta chars in them.
>>
>> and if you are going to use meta-chars, why not just have a flag such as:
>>
>> --factorDelimiter=|
>>
>> etc.
>>
>> Miles
>>
>> On 15 November 2010 21:30, Hieu Hoang<[email protected]>  wrote:
>>> That's a good idea. In the decoder, there's 4 places that has to be
>>> changed cos it's hardcoded
>>>    ConfusionNet
>>>     GenerationDictionary
>>>    LanguageModelJoint
>>>     Word::createFromString
>>>
>>> However, the train-model.perl is more difficult to change
>>>
>>> Hieu
>>> Sent from my flying horse
>>>
>>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz<[email protected]>  wrote:
>>>
>>>> I'd like to propose changing the current factor delimiter to something 
>>>> other than the single vertical bar |
>>>>
>>>> Looking through the mailing archives, it seems that the failure to 
>>>> properly purge your corpus of vertical bars is a frequent source of 
>>>> headaches for users. I know I've encountered this problem before, but even 
>>>> knowing that I should do this, just today I had to track down another 
>>>> vertical bar-related problem.
>>>>
>>>> I don't really care what the replacement character(s) ends up being, just 
>>>> so that any corpus munging related to this delimiter gets handled 
>>>> internally by moses rather than being the user's responsibility.
>>>>
>>>> If moses could easily be modified to take a multi-character delimeter, 
>>>> that would probably be best. My suggestion for a single-character 
>>>> delimiter would be something with the following characteristics:
>>>>
>>>> * Character should be printable (ie not a control character)
>>>> * Character should be one that's implemented in most commonly used fonts
>>>> * Character should be highly obscure, and extremely unlikely to appear in 
>>>> a corpus
>>>> * Character should not be confusable with any commonly used character.
>>>>
>>>> Many characters in the Dingbats section of Unicode (block 2700) would fit 
>>>> these desiderata.
>>>>
>>>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
>>>> obscure printable character that looks like a thick vertical bar. It's 
>>>> obviously a vertical bar, but just as obviously not the same thing as the 
>>>> regular vertical bar |.
>>>>
>>>> Cheers,
>>>> Lane
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to