Three ideas come to mind. 1) make sure you are setting the `-l en` argument properly. Each language isolates punctuation differently. 2) are you sure your file is UTF-8 character encoding? 3) maybe the version of Perl you're using is sensitive to other things like Locale settings. Try setting the terminal's environment variable to LC_ALL=C.
Happy New Year! On 12/31/2014 07:05 PM, Ihab Ramadan wrote: > Thanks Tom for your reply, > I think I found where is the problem, when I use the tokenizer.perl script > to tokenize a string it generates the output you mentioned like > " keep your notification 's payload under 5 kb ." but if use the > tokenizer.perl script to process a file the output will be > " keep your notification ' s payload under 5 kb ." which adds a space > between ' and s and this makes some translation problems > Can you please tell me why this happens > Thanks > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of [email protected] > Sent: Tuesday, December 30, 2014 5:56 AM > To: [email protected] > Subject: Moses-support Digest, Vol 98, Issue 65 > > Send Moses-support mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > http://mailman.mit.edu/mailman/listinfo/moses-support > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific than > "Re: Contents of Moses-support digest..." > > > Today's Topics: > > 1. Moses tokenizer treats combining diaeresis inconsistently > (Kenneth Heafield) > 2. Re: Moses tokenizer treats combining diaeresis inconsistently > (John D Burger) > 3. Re: "'" in tokenization (Tom Hoar) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 29 Dec 2014 16:05:51 -0500 > From: Kenneth Heafield <[email protected]> > Subject: [Moses-support] Moses tokenizer treats combining diaeresis > inconsistently > To: "[email protected]" <[email protected]> > Message-ID: <[email protected]> > Content-Type: text/plain; charset="utf-8" > > Dear Moses, > > The attached file, taken from line 2345157 of > http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu > ffled.gz > , tokenizes differently on different machines. > > I'm running tokenizer.perl from head (481a07dc) with this perl: > > This is perl 5, version 18, subversion 2 (v5.18.2) built for > x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more > detail) > > perl -V is attached from newer machines. > > The input is "J?rgen" with a specific encoding: > > uconv -f utf-8 -x any-name jur > > \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL > LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} > > So the umlaut is encoded as a normal "u" character followed by a combining > diaeresis marker. This encoding is legal, but it differs from the > single-character canonical encoding of \N{LATIN SMALL LETTER U WITH > DIAERESIS}. > > Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS} > is a single character and recognizing it as part of the IsAlnum class. > Tokenizing on these machines outputs > > J?rgen > > Newer machines are treating them separately, recognizing \N{COMBINING > DIAERESIS} as a separate character that is not part of IsAlnum. The Moses > tokenizer then treats it as something to split off, yielding this > tokenization: > > Ju ? rgen > > I thought it might be locale-related but IsAlnum is supposed to be > locale-agnostic. I couldn't come up with environment variables that made > the new machines tokenize as a single word. > > Maybe this is a perl bug, but the result is that two different machines > running the same perl script produce different tokenization :-(. > > This is also a reason to turn Unicode normalization on. If the tokenizer > did NFKC at the beginning, then the problem would go away. > > Kenneth > > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: jur.gz > Type: application/gzip > Size: 33 bytes > Desc: not available > Url : > http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c > e44a08/attachment-0001.bin > -------------- next part -------------- > Summary of my perl5 (revision 5 version 18 subversion 2) configuration: > > Platform: > osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi > uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 > intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux ' > config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread > -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe > -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr > -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin > -Dprivlib=/usr/lib64/perl5/5.18.2 > -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi > -Dsitelib=/usr/local/lib64/perl5/5.18.2 > -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi > -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 > -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi > -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 > -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 > -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 > -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 > -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles > -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost > -Dperladmin=root@loca! > lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db > -Dusethreads -DDEBUGGING=none > -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 > 5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64 > /usr/lib64 -Dnoextensions=ODBM_File' > hint=recommended, useposix=true, d_sigaction=define > useithreads=define, usemultiplicity=define > useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef > use64bitint=define, use64bitall=define, uselongdouble=undef > usemymalloc=n, bincompat5005=undef > Compiler: > cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE > -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', > optimize='-O3 -march=native -pipe', > cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe' > ccversion='', gccversion='4.7.3', gccosandvers='' > intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 > d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 > ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', > lseeksize=8 > alignbytes=8, prototype=define > Linker and Libraries: > ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed' > libpth=/usr/local/lib64 /lib64 /usr/lib64 > libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc > -lgdbm_compat > perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc > libc=/lib/libc-2.19.so, so=so, useshrplib=true, > libperl=libperl.so.5.18.2 > gnulibc_version='2.19' > Dynamic Linking: > dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' > cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 > -Wl,--as-needed' > > > Characteristics of this binary (from libperl): > Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS > PERL_DONT_CREATE_GVSV > PERL_HASH_FUNC_ONE_AT_A_TIME_HARD > PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP > PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL > USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES > USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE > USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF > USE_REENTRANT_API > Locally applied patches: > gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054 > cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH > gentoo/EUMM_delete_packlist - Don't install .packlist or > perllocal.pod for perl or vendor > gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags > gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default > for modules installed from CPAN. > gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site > directories by default. > gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set > libperl soname > gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't > force -fstack-protector on everyone. > gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing > @INC directories. > gentoo/mod_paths - Add /etc/perl to @INC > gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in > patchlevel.h > gentoo/aix_soname - aix gcc detection and shared library soname > support > gentoo/opensolars_headers - Add headers for opensolaris > gentoo/cleanup-paths - Cleanup PATH and shrpenv > gentoo/usr_local - Remove /usr/local paths > gentoo/hints_hpux - Fix hpux hints > gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC > to link > gentoo/interix - Fix interix hints > fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP > 'Port' option > debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with > nonexisting site dirs if a parent is writable > fixes/memoize_storable_nstore - [rt.cpan.org #77790] > Memoize::Storable: respect 'nstore' option not respected > fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope > gracefully with a failed command > fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look > up the list of local patches at run time > fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576] > untaint version, if needed, in Module::Metadata > fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of > IPC_CREAT in IPC-SysV documentation > fixes/freemint - > Built under linux > Compiled at Oct 29 2014 20:59:02 > @INC: > /etc/perl > /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi > /usr/local/lib64/perl5/5.18.2 > /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi > /usr/lib64/perl5/vendor_perl/5.18.2 > /usr/local/lib64/perl5 > /usr/lib64/perl5/vendor_perl > /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi > /usr/lib64/perl5/5.18.2 > . > > ------------------------------ > > Message: 2 > Date: Mon, 29 Dec 2014 16:40:42 -0500 > From: John D Burger <[email protected]> > Subject: Re: [Moses-support] Moses tokenizer treats combining > diaeresis inconsistently > To: "[email protected]" <[email protected]> > Message-ID: <[email protected]> > Content-Type: text/plain; charset=utf-8 > >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. > If I understand the situation correctly, this would only fix this particular > example and a few others like it. There are many base+combining grapheme > clusters in Unicode text which cannot be normalized to a single pre-composed > character. Vietnamese comes to mind. > > - JB > > On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote: > >> Dear Moses, >> >> The attached file, taken from line 2345157 of >> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013. >> en.shuffled.gz , tokenizes differently on different machines. >> >> I'm running tokenizer.perl from head (481a07dc) with this perl: >> >> This is perl 5, version 18, subversion 2 (v5.18.2) built for >> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for >> more detail) >> >> perl -V is attached from newer machines. >> >> The input is "J?rgen" with a specific encoding: >> >> uconv -f utf-8 -x any-name jur >> >> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN >> SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} >> >> So the umlaut is encoded as a normal "u" character followed by a >> combining diaeresis marker. This encoding is legal, but it differs >> from the single-character canonical encoding of \N{LATIN SMALL LETTER >> U WITH DIAERESIS}. >> >> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS} is a single character and recognizing it as part of the >> IsAlnum class. Tokenizing on these machines outputs >> >> J?rgen >> >> Newer machines are treating them separately, recognizing \N{COMBINING >> DIAERESIS} as a separate character that is not part of IsAlnum. The >> Moses tokenizer then treats it as something to split off, yielding >> this >> tokenization: >> >> Ju ? rgen >> >> I thought it might be locale-related but IsAlnum is supposed to be >> locale-agnostic. I couldn't come up with environment variables that >> made the new machines tokenize as a single word. >> >> Maybe this is a perl bug, but the result is that two different >> machines running the same perl script produce different tokenization :-(. >> >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. >> >> Kenneth >> >> <jur.gz><perl_V.txt>_______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > ------------------------------ > > Message: 3 > Date: Tue, 30 Dec 2014 10:54:18 +0700 > From: Tom Hoar <[email protected]> > Subject: Re: [Moses-support] "'" in tokenization > To: [email protected] > Message-ID: <[email protected]> > Content-Type: text/plain; charset="utf-8" > > > The escaping is necessary because Moses reserves these characters for other > uses. When corpora are consistently prepared, the escaping has no effect on > translation results. It looks like you have not prepared your corpora > consistently. Note my results ('s) are different from yours (' s): > > user@host:~$ echo "keep your notification's payload under 5 kb." | > tokenizer.perl -l en Tokenizer Version 1.1 > Language: en > Number of threads: 1 > keep your notification 's payload under 5 kb . > > Go back and double-check how you prepare your training corpus and your > translation jobs. > > > On 12/29/2014 09:26 PM, Ihab Ramadan wrote: >> Dears, >> >> When I make tokenization on files it replaces the apostrophes with >> ?'? which make sense, but in the other side it crashes the >> meaning and the order of the words at all, for example: >> >> Sentence before tokenization : >> >> Src : keep your notification's payload under 5 kb. >> >> Trg: ???? ????? ??????? ??? ?? 5 ????????. >> >> Sentence after tokenization : >> >> Src: keep your notification ' s payload under 5 kb . >> >> Trg: ???? ????? ??????? ??? ?? 5 ????????. >> >> If I translate ?keep? without using tokenization it will generates >> ?????? which Is correct but after using tokenization moses generates >> ????????? which means that the alignment is crashed >> >> do I make something wrong? >> >> do I miss something or just it is a natural behavior when I use >> tokenization >> >> Thanks >> >> Best Regards >> >> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> >> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | >> Fax+20233032036 | *Follow us on *linked >> > <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V > SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri > mary>* | >> **ZA102637861* >> > <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo > okmark>* | >> **ZA102637858* <https://twitter.com/Saudisoft> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb > 3cde56/attachment.htm > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: not available > Type: image/gif > Size: 1314 bytes > Desc: not available > Url : > http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb > 3cde56/attachment.gif > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: not available > Type: image/gif > Size: 1317 bytes > Desc: not available > Url : > http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb > 3cde56/attachment-0001.gif > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: not available > Type: image/gif > Size: 1351 bytes > Desc: not available > Url : > http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb > 3cde56/attachment-0002.gif > > ------------------------------ > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > > End of Moses-support Digest, Vol 98, Issue 65 > ********************************************* > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
