Thanks Tom for your reply, I think I found where is the problem, when I use the tokenizer.perl script to tokenize a string it generates the output you mentioned like " keep your notification 's payload under 5 kb ." but if use the tokenizer.perl script to process a file the output will be " keep your notification ' s payload under 5 kb ." which adds a space between ' and s and this makes some translation problems Can you please tell me why this happens Thanks
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of [email protected] Sent: Tuesday, December 30, 2014 5:56 AM To: [email protected] Subject: Moses-support Digest, Vol 98, Issue 65 Send Moses-support mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to [email protected] You can reach the person managing the list at [email protected] When replying, please edit your Subject line so it is more specific than "Re: Contents of Moses-support digest..." Today's Topics: 1. Moses tokenizer treats combining diaeresis inconsistently (Kenneth Heafield) 2. Re: Moses tokenizer treats combining diaeresis inconsistently (John D Burger) 3. Re: "'" in tokenization (Tom Hoar) ---------------------------------------------------------------------- Message: 1 Date: Mon, 29 Dec 2014 16:05:51 -0500 From: Kenneth Heafield <[email protected]> Subject: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently To: "[email protected]" <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset="utf-8" Dear Moses, The attached file, taken from line 2345157 of http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu ffled.gz , tokenizes differently on different machines. I'm running tokenizer.perl from head (481a07dc) with this perl: This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more detail) perl -V is attached from newer machines. The input is "J?rgen" with a specific encoding: uconv -f utf-8 -x any-name jur \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} So the umlaut is encoded as a normal "u" character followed by a combining diaeresis marker. This encoding is legal, but it differs from the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH DIAERESIS}. Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS} is a single character and recognizing it as part of the IsAlnum class. Tokenizing on these machines outputs J?rgen Newer machines are treating them separately, recognizing \N{COMBINING DIAERESIS} as a separate character that is not part of IsAlnum. The Moses tokenizer then treats it as something to split off, yielding this tokenization: Ju ? rgen I thought it might be locale-related but IsAlnum is supposed to be locale-agnostic. I couldn't come up with environment variables that made the new machines tokenize as a single word. Maybe this is a perl bug, but the result is that two different machines running the same perl script produce different tokenization :-(. This is also a reason to turn Unicode normalization on. If the tokenizer did NFKC at the beginning, then the problem would go away. Kenneth -------------- next part -------------- A non-text attachment was scrubbed... Name: jur.gz Type: application/gzip Size: 33 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c e44a08/attachment-0001.bin -------------- next part -------------- Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux ' config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib64/perl5/5.18.2 -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dsitelib=/usr/local/lib64/perl5/5.18.2 -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@loca! lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dnoextensions=ODBM_File' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O3 -march=native -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe' ccversion='', gccversion='4.7.3', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 -Wl,--as-needed' Characteristics of this binary (from libperl): Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS PERL_DONT_CREATE_GVSV PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API Locally applied patches: gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054 cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH gentoo/EUMM_delete_packlist - Don't install .packlist or perllocal.pod for perl or vendor gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site directories by default. gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set libperl soname gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't force -fstack-protector on everyone. gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing @INC directories. gentoo/mod_paths - Add /etc/perl to @INC gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in patchlevel.h gentoo/aix_soname - aix gcc detection and shared library soname support gentoo/opensolars_headers - Add headers for opensolaris gentoo/cleanup-paths - Cleanup PATH and shrpenv gentoo/usr_local - Remove /usr/local paths gentoo/hints_hpux - Fix hpux hints gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC to link gentoo/interix - Fix interix hints fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP 'Port' option debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable fixes/memoize_storable_nstore - [rt.cpan.org #77790] Memoize::Storable: respect 'nstore' option not respected fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope gracefully with a failed command fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look up the list of local patches at run time fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576] untaint version, if needed, in Module::Metadata fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation fixes/freemint - Built under linux Compiled at Oct 29 2014 20:59:02 @INC: /etc/perl /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi /usr/local/lib64/perl5/5.18.2 /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.18.2 /usr/local/lib64/perl5 /usr/lib64/perl5/vendor_perl /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi /usr/lib64/perl5/5.18.2 . ------------------------------ Message: 2 Date: Mon, 29 Dec 2014 16:40:42 -0500 From: John D Burger <[email protected]> Subject: Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently To: "[email protected]" <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset=utf-8 > This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away. If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind. - JB On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote: > Dear Moses, > > The attached file, taken from line 2345157 of > http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013. > en.shuffled.gz , tokenizes differently on different machines. > > I'm running tokenizer.perl from head (481a07dc) with this perl: > > This is perl 5, version 18, subversion 2 (v5.18.2) built for > x86_64-linux-thread-multi (with 25 registered patches, see perl -V for > more detail) > > perl -V is attached from newer machines. > > The input is "J?rgen" with a specific encoding: > > uconv -f utf-8 -x any-name jur > > \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN > SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} > > So the umlaut is encoded as a normal "u" character followed by a > combining diaeresis marker. This encoding is legal, but it differs > from the single-character canonical encoding of \N{LATIN SMALL LETTER > U WITH DIAERESIS}. > > Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS} is a single character and recognizing it as part of the > IsAlnum class. Tokenizing on these machines outputs > > J?rgen > > Newer machines are treating them separately, recognizing \N{COMBINING > DIAERESIS} as a separate character that is not part of IsAlnum. The > Moses tokenizer then treats it as something to split off, yielding > this > tokenization: > > Ju ? rgen > > I thought it might be locale-related but IsAlnum is supposed to be > locale-agnostic. I couldn't come up with environment variables that > made the new machines tokenize as a single word. > > Maybe this is a perl bug, but the result is that two different > machines running the same perl script produce different tokenization :-(. > > This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away. > > Kenneth > > <jur.gz><perl_V.txt>_______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support ------------------------------ Message: 3 Date: Tue, 30 Dec 2014 10:54:18 +0700 From: Tom Hoar <[email protected]> Subject: Re: [Moses-support] "'" in tokenization To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset="utf-8" The escaping is necessary because Moses reserves these characters for other uses. When corpora are consistently prepared, the escaping has no effect on translation results. It looks like you have not prepared your corpora consistently. Note my results ('s) are different from yours (' s): user@host:~$ echo "keep your notification's payload under 5 kb." | tokenizer.perl -l en Tokenizer Version 1.1 Language: en Number of threads: 1 keep your notification 's payload under 5 kb . Go back and double-check how you prepare your training corpus and your translation jobs. On 12/29/2014 09:26 PM, Ihab Ramadan wrote: > > Dears, > > When I make tokenization on files it replaces the apostrophes with > ?'? which make sense, but in the other side it crashes the > meaning and the order of the words at all, for example: > > Sentence before tokenization : > > Src : keep your notification's payload under 5 kb. > > Trg: ???? ????? ??????? ??? ?? 5 ????????. > > Sentence after tokenization : > > Src: keep your notification ' s payload under 5 kb . > > Trg: ???? ????? ??????? ??? ?? 5 ????????. > > If I translate ?keep? without using tokenization it will generates > ?????? which Is correct but after using tokenization moses generates > ????????? which means that the alignment is crashed > > do I make something wrong? > > do I miss something or just it is a natural behavior when I use > tokenization > > Thanks > > Best Regards > > /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> > - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | > Fax+20233032036 | *Follow us on *linked > <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri mary>* | > **ZA102637861* > <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo okmark>* | > **ZA102637858* <https://twitter.com/Saudisoft> > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb 3cde56/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1314 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb 3cde56/attachment.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1317 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb 3cde56/attachment-0001.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1351 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb 3cde56/attachment-0002.gif ------------------------------ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 98, Issue 65 ********************************************* _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
