Hi Tom
thanks a lot.
I'm sure it is a very useful resource

Marco

On Tue, Aug 30, 2011 at 9:12 PM, Tom Hoar <
[email protected]> wrote:

> This link has tables with reportedly problematic characters and sequences.
>
>
> http://www.precisiontranslationtools.com/~precis30/index.php?option=com_content&view=article&id=94:are-there-characters-that-cause-problems-in-moses&catid=30:key-concepts&Itemid=57
>
> It's missing the 5 reserved XML characters:
>
> &      &amp;
> <      &lt;
> >      &gt;
> '       &apos;
> "      &quot;
>
> Tom
>
>
>
> On Tue, 30 Aug 2011 14:11:47 +0200, marco turchi <[email protected]>
> wrote:
>
> Hi Tom,
> you are right, you should have the same problem when you decode the same
> data. In my case, the vertical bar is the main issue (for the moment).
> Would be possible to share your list of characters?
>
> Thanks a lot and good luck!
> Marco
>
>
>
>
> On Tue, Aug 30, 2011 at 1:47 PM, Tom Hoar <
> [email protected]> wrote:
>
>> Thanks Marco,
>>
>> I considered that, but it doesn't explain why resuming on the same file is
>> successful without any changes. Time and data volume seem to be the culprits
>> here, and our work-around to auto-resume seems to be holding for now.
>>
>> I'm aware of the reserved characters. Our pre-processing tool chain
>> escapes 33 non-printing ASCII/ANSI control characters, plus the vertical
>> bar, plus 5 reserved XML characters. Nonetheless, we had two files last
>> night that caused a broken pipe (not a symptom on the time/volume problem
>> above). Resuming after these interruptions was not successful and we skipped
>> the files. These two files seem to contain non-printing character, despite
>> our best efforts to escape everything. I suspect it's a non-printing UTF-8
>> control character, such as en space (U+2002 ISOpub), em space (U+2003
>> ISOpub), thin space (U+2009 ISOpub), etc. Again, more to do next week.
>>
>> Tom
>>
>>
>>
>> On Mon, 29 Aug 2011 14:16:02 +0200, marco turchi <[email protected]>
>> wrote:
>>
>> Hi Tom,
>> I'm running something similar to your wrapper in Java with a 16 core
>> (thanks to hyperthreading)  machines, and a common problem that I had at the
>> beginning was the presence of the "|" characters in the source sentence.
>>
>> Cheers
>> Marco
>>
>> On Mon, Aug 29, 2011 at 1:58 PM, Barry Haddow <[email protected]>wrote:
>>
>>> Hi Tom
>>>
>>> If one of the moses caches was filling up, then I would expect that the
>>> process memory would increase, until the machine ground to a halt. The
>>> problem that Ivan had with the original version of his wrapper was
>>> slightly
>>> different, there was a fixed size i/o buffer that he wasn't emptying,
>>> which
>>> eventually deadlocked his process.
>>>
>>> The lm cache that you mentioned below is, as far as I'm aware, specific
>>> to
>>> irstlm, so if you're not using irstlm, then the flag should have no
>>> effect.
>>>
>>> The translation option cache mainly helps by caching translation options
>>> for
>>> common phrases like 'the' or '.'. It  is implemented as an LRU cache, and
>>> the
>>> decoder removes an entry when the cache reaches maximum size. I don't
>>> understand the quote from the manual about making sure this cache is
>>> frequently cleared - could you tell me where this quote comes from?
>>> Tuning
>>> the size of the translation options cache may help with performance, but
>>> it's
>>> unlikely to be the cause of the unexplained crashes.
>>>
>>> We frequently run multi-threaded decodes on mult-core machines and
>>> haven't
>>> witnessed any unexplained crashes. So I would quite like to eliminate the
>>> python/moses interaction as cause of error. Is it possible to run a
>>> similar
>>> experiment without the python wrapper, say by just passing moses your
>>> source
>>> sentences in a file? If it is moses that is crashing, then if you could
>>> allow
>>> it to generate a core file and make it available, then I'd have some
>>> chance
>>> of debugging it,
>>>
>>> cheers - Barry
>>>
>>>
>>>
>>> On Monday 29 August 2011 12:23, Tom Hoar wrote:
>>> > I've implemented a multi-threaded Python wrapper that loads moses
>>> > decoder and pipes strings through the moses binary. It's similar to
>>> Ivan
>>> > Uemlianin's code from May 04, 2010 on this listserv, but achieves a
>>> > throughput efficiency 398% CPU load on a quad-core host across multiple
>>> > documents processed in a queue.
>>> >
>>> > Here's the rub. The decoder & the
>>> > wrapper run great for about 2 hours. Then they halt with an unknown
>>> > error. It's difficult to trace because it takes hours to reproduce. I
>>> > can see that the Moses binary doesn't generate an error exit code.
>>> > There's no error message about a "broken" pipe. When I restart the
>>> > script on the file that was in-process at the time of hault, it runs
>>> > just fine and continues processing. Since the error occurs consistently
>>> > at the 2 hour mark, and it's not the file causing the halt, I suspect
>>> at
>>> > a cache or buffer somewhere is overloaded. I've checked my python code,
>>> > and don't believe there are any buffer overruns there.
>>> >
>>> > I'm hoping
>>> > someone can review my comments and give me some pointers about Moses'
>>> > caches and how to verify manage the caches. The Moses manual describes
>>> > three cache:
>>> >
>>> >       * "-clean-lm-cache: clean language model caches after N
>>> > translations (default N=1)" : If -clean-lm-cache defaults to cleaning
>>> > the lm cache after each translation, I don't think this is a problem.
>>> >
>>> >       * "-persistent-cache-size: maximum size of cache for translation
>>> > options (default 10,000 input phrases)" : Some of my files have my
>>> files
>>> > have 2,500 or more pages with 20-25 sentence lines each. This could
>>> > exceed the default 10,000 input phrase cache. Would it be better to
>>> bump
>>> > up the -persistent-cache-size value, or manage the number of phrase I
>>> > pass to the input?
>>> >       * "-use-persistent-cache: cache translation
>>> > options across sentences (default true)" : Regarding cashing across
>>> > sentences (which presumably apples to -use-persistent-cache), the
>>> manual
>>> > says, "you should also make sure that the cache is frequently cleared."
>>> > How do I clear the cache? Does this require forcing moses itself to
>>> > unload, and then reload it? Also, the -use-persistent-cache value
>>> > defaults to "true". What is the effect of changing this to "false"?
>>> Does
>>> > it effectively disable this cache and eliminate the requirement to
>>> clear
>>> > it?
>>> >
>>> > Thanks,
>>> > Tom
>>>
>>>  --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to