Hi Tom thanks a lot. I'm sure it is a very useful resource Marco
On Tue, Aug 30, 2011 at 9:12 PM, Tom Hoar < [email protected]> wrote: > This link has tables with reportedly problematic characters and sequences. > > > http://www.precisiontranslationtools.com/~precis30/index.php?option=com_content&view=article&id=94:are-there-characters-that-cause-problems-in-moses&catid=30:key-concepts&Itemid=57 > > It's missing the 5 reserved XML characters: > > & & > < < > > > > ' ' > " " > > Tom > > > > On Tue, 30 Aug 2011 14:11:47 +0200, marco turchi <[email protected]> > wrote: > > Hi Tom, > you are right, you should have the same problem when you decode the same > data. In my case, the vertical bar is the main issue (for the moment). > Would be possible to share your list of characters? > > Thanks a lot and good luck! > Marco > > > > > On Tue, Aug 30, 2011 at 1:47 PM, Tom Hoar < > [email protected]> wrote: > >> Thanks Marco, >> >> I considered that, but it doesn't explain why resuming on the same file is >> successful without any changes. Time and data volume seem to be the culprits >> here, and our work-around to auto-resume seems to be holding for now. >> >> I'm aware of the reserved characters. Our pre-processing tool chain >> escapes 33 non-printing ASCII/ANSI control characters, plus the vertical >> bar, plus 5 reserved XML characters. Nonetheless, we had two files last >> night that caused a broken pipe (not a symptom on the time/volume problem >> above). Resuming after these interruptions was not successful and we skipped >> the files. These two files seem to contain non-printing character, despite >> our best efforts to escape everything. I suspect it's a non-printing UTF-8 >> control character, such as en space (U+2002 ISOpub), em space (U+2003 >> ISOpub), thin space (U+2009 ISOpub), etc. Again, more to do next week. >> >> Tom >> >> >> >> On Mon, 29 Aug 2011 14:16:02 +0200, marco turchi <[email protected]> >> wrote: >> >> Hi Tom, >> I'm running something similar to your wrapper in Java with a 16 core >> (thanks to hyperthreading) machines, and a common problem that I had at the >> beginning was the presence of the "|" characters in the source sentence. >> >> Cheers >> Marco >> >> On Mon, Aug 29, 2011 at 1:58 PM, Barry Haddow <[email protected]>wrote: >> >>> Hi Tom >>> >>> If one of the moses caches was filling up, then I would expect that the >>> process memory would increase, until the machine ground to a halt. The >>> problem that Ivan had with the original version of his wrapper was >>> slightly >>> different, there was a fixed size i/o buffer that he wasn't emptying, >>> which >>> eventually deadlocked his process. >>> >>> The lm cache that you mentioned below is, as far as I'm aware, specific >>> to >>> irstlm, so if you're not using irstlm, then the flag should have no >>> effect. >>> >>> The translation option cache mainly helps by caching translation options >>> for >>> common phrases like 'the' or '.'. It is implemented as an LRU cache, and >>> the >>> decoder removes an entry when the cache reaches maximum size. I don't >>> understand the quote from the manual about making sure this cache is >>> frequently cleared - could you tell me where this quote comes from? >>> Tuning >>> the size of the translation options cache may help with performance, but >>> it's >>> unlikely to be the cause of the unexplained crashes. >>> >>> We frequently run multi-threaded decodes on mult-core machines and >>> haven't >>> witnessed any unexplained crashes. So I would quite like to eliminate the >>> python/moses interaction as cause of error. Is it possible to run a >>> similar >>> experiment without the python wrapper, say by just passing moses your >>> source >>> sentences in a file? If it is moses that is crashing, then if you could >>> allow >>> it to generate a core file and make it available, then I'd have some >>> chance >>> of debugging it, >>> >>> cheers - Barry >>> >>> >>> >>> On Monday 29 August 2011 12:23, Tom Hoar wrote: >>> > I've implemented a multi-threaded Python wrapper that loads moses >>> > decoder and pipes strings through the moses binary. It's similar to >>> Ivan >>> > Uemlianin's code from May 04, 2010 on this listserv, but achieves a >>> > throughput efficiency 398% CPU load on a quad-core host across multiple >>> > documents processed in a queue. >>> > >>> > Here's the rub. The decoder & the >>> > wrapper run great for about 2 hours. Then they halt with an unknown >>> > error. It's difficult to trace because it takes hours to reproduce. I >>> > can see that the Moses binary doesn't generate an error exit code. >>> > There's no error message about a "broken" pipe. When I restart the >>> > script on the file that was in-process at the time of hault, it runs >>> > just fine and continues processing. Since the error occurs consistently >>> > at the 2 hour mark, and it's not the file causing the halt, I suspect >>> at >>> > a cache or buffer somewhere is overloaded. I've checked my python code, >>> > and don't believe there are any buffer overruns there. >>> > >>> > I'm hoping >>> > someone can review my comments and give me some pointers about Moses' >>> > caches and how to verify manage the caches. The Moses manual describes >>> > three cache: >>> > >>> > * "-clean-lm-cache: clean language model caches after N >>> > translations (default N=1)" : If -clean-lm-cache defaults to cleaning >>> > the lm cache after each translation, I don't think this is a problem. >>> > >>> > * "-persistent-cache-size: maximum size of cache for translation >>> > options (default 10,000 input phrases)" : Some of my files have my >>> files >>> > have 2,500 or more pages with 20-25 sentence lines each. This could >>> > exceed the default 10,000 input phrase cache. Would it be better to >>> bump >>> > up the -persistent-cache-size value, or manage the number of phrase I >>> > pass to the input? >>> > * "-use-persistent-cache: cache translation >>> > options across sentences (default true)" : Regarding cashing across >>> > sentences (which presumably apples to -use-persistent-cache), the >>> manual >>> > says, "you should also make sure that the cache is frequently cleared." >>> > How do I clear the cache? Does this require forcing moses itself to >>> > unload, and then reload it? Also, the -use-persistent-cache value >>> > defaults to "true". What is the effect of changing this to "false"? >>> Does >>> > it effectively disable this cache and eliminate the requirement to >>> clear >>> > it? >>> > >>> > Thanks, >>> > Tom >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
