Re: [Moses-support] Moses-support Digest, Vol 98, Issue 65

Tom Hoar Wed, 31 Dec 2014 05:10:14 -0800

Three ideas come to mind.

1) make sure you are setting the `-l en` argument properly. Each 
language isolates punctuation differently.
2) are you sure your file is UTF-8 character encoding?
3) maybe the version of Perl you're using is sensitive to other things 
like Locale settings. Try setting the terminal's environment variable to 
LC_ALL=C.


Happy New Year!


On 12/31/2014 07:05 PM, Ihab Ramadan wrote:
> Thanks Tom for your reply,
> I think I found where is the problem, when I use the tokenizer.perl script
> to tokenize a string it generates the output you mentioned like
> " keep your notification &apos;s payload under 5 kb ." but if use the
> tokenizer.perl script to process a file the output will be
> " keep your notification &apos; s payload under 5 kb ." which adds a space
> between &apos; and s and this makes some translation problems
> Can you please tell me why this happens
> Thanks
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of [email protected]
> Sent: Tuesday, December 30, 2014 5:56 AM
> To: [email protected]
> Subject: Moses-support Digest, Vol 98, Issue 65
>
> Send Moses-support mailing list submissions to
>       [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>       http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
>       [email protected]
>
> You can reach the person managing the list at
>       [email protected]
>
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
>     1. Moses tokenizer treats combining diaeresis     inconsistently
>        (Kenneth Heafield)
>     2. Re: Moses tokenizer treats combining diaeresis inconsistently
>        (John D Burger)
>     3. Re: "&apos;" in tokenization (Tom Hoar)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 29 Dec 2014 16:05:51 -0500
> From: Kenneth Heafield <[email protected]>
> Subject: [Moses-support] Moses tokenizer treats combining diaeresis
>       inconsistently
> To: "[email protected]" <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Moses,
>
>       The attached file, taken from line 2345157 of
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu
> ffled.gz
> , tokenizes differently on different machines.
>
>       I'm running tokenizer.perl from head (481a07dc) with this perl:
>
> This is perl 5, version 18, subversion 2 (v5.18.2) built for
> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more
> detail)
>
> perl -V is attached from newer machines.
>
>       The input is "J?rgen" with a specific encoding:
>
> uconv -f utf-8 -x any-name jur
>
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>
> So the umlaut is encoded as a normal "u" character followed by a combining
> diaeresis marker.  This encoding is legal, but it differs from the
> single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
> DIAERESIS}.
>
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}
> is a single character and recognizing it as part of the IsAlnum class.
> Tokenizing on these machines outputs
>
> J?rgen
>
> Newer machines are treating them separately, recognizing \N{COMBINING
> DIAERESIS} as a separate character that is not part of IsAlnum.  The Moses
> tokenizer then treats it as something to split off, yielding this
> tokenization:
>
> Ju ? rgen
>
> I thought it might be locale-related but IsAlnum is supposed to be
> locale-agnostic.  I couldn't come up with environment variables that made
> the new machines tokenize as a single word.
>
> Maybe this is a perl bug, but the result is that two different machines
> running the same perl script produce different tokenization :-(.
>
> This is also a reason to turn Unicode normalization on.  If the tokenizer
> did NFKC at the beginning, then the problem would go away.
>
> Kenneth
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: jur.gz
> Type: application/gzip
> Size: 33 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c
> e44a08/attachment-0001.bin
> -------------- next part --------------
> Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
>     
>    Platform:
>      osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
>      uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64
> intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
>      config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread
> -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe
> -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
> -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
> -Dprivlib=/usr/lib64/perl5/5.18.2
> -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> -Dsitelib=/usr/local/lib64/perl5/5.18.2
> -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2
> -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
> -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
> -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
> -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
> -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2
> -Dlocincpth=/usr/include  -Dglibpth=/lib64 /usr/lib64  -Duselargefiles
> -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
> -Dperladmin=root@loca!
>   lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db
> -Dusethreads -DDEBUGGING=none
> -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0
> 5.18.1/x86_64-linux-thread-multi 5.18.1  -Dlibpth=/usr/local/lib64 /lib64
> /usr/lib64 -Dnoextensions=ODBM_File'
>      hint=recommended, useposix=true, d_sigaction=define
>      useithreads=define, usemultiplicity=define
>      useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
>      use64bitint=define, use64bitall=define, uselongdouble=undef
>      usemymalloc=n, bincompat5005=undef
>    Compiler:
>      cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
> -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
>      optimize='-O3 -march=native -pipe',
>      cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
>      ccversion='', gccversion='4.7.3', gccosandvers=''
>      intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
>      d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
>      ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
> lseeksize=8
>      alignbytes=8, prototype=define
>    Linker and Libraries:
>      ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
>      libpth=/usr/local/lib64 /lib64 /usr/lib64
>      libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
> -lgdbm_compat
>      perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
>      libc=/lib/libc-2.19.so, so=so, useshrplib=true,
> libperl=libperl.so.5.18.2
>      gnulibc_version='2.19'
>    Dynamic Linking:
>      dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
>      cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1
> -Wl,--as-needed'
>
>
> Characteristics of this binary (from libperl):
>    Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
>                          PERL_DONT_CREATE_GVSV
>                          PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
>                          PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
>                          PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
>                          USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
>                          USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
>                          USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
>                          USE_REENTRANT_API
>    Locally applied patches:
>       gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
> cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
>       gentoo/EUMM_delete_packlist - Don't install .packlist or
> perllocal.pod for perl or vendor
>       gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
>       gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
> for modules installed from CPAN.
>       gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
> directories by default.
>       gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set
> libperl soname
>       gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't
> force -fstack-protector on everyone.
>       gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
> @INC directories.
>       gentoo/mod_paths - Add /etc/perl to @INC
>       gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in
> patchlevel.h
>       gentoo/aix_soname - aix gcc detection and shared library soname
> support
>       gentoo/opensolars_headers - Add headers for opensolaris
>       gentoo/cleanup-paths - Cleanup PATH and shrpenv
>       gentoo/usr_local - Remove /usr/local paths
>       gentoo/hints_hpux - Fix hpux hints
>       gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC
> to link
>       gentoo/interix - Fix interix hints
>       fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
> 'Port' option
>       debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
> nonexisting site dirs if a parent is writable
>       fixes/memoize_storable_nstore - [rt.cpan.org #77790]
> Memoize::Storable: respect 'nstore' option not respected
>       fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope
> gracefully with a failed command
>       fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look
> up the list of local patches at run time
>       fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576]
> untaint version, if needed, in Module::Metadata
>       fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of
> IPC_CREAT in IPC-SysV documentation
>       fixes/freemint -
>    Built under linux
>    Compiled at Oct 29 2014 20:59:02
>    @INC:
>      /etc/perl
>      /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
>      /usr/local/lib64/perl5/5.18.2
>      /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
>      /usr/lib64/perl5/vendor_perl/5.18.2
>      /usr/local/lib64/perl5
>      /usr/lib64/perl5/vendor_perl
>      /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
>      /usr/lib64/perl5/5.18.2
>      .
>
> ------------------------------
>
> Message: 2
> Date: Mon, 29 Dec 2014 16:40:42 -0500
> From: John D Burger <[email protected]>
> Subject: Re: [Moses-support] Moses tokenizer treats combining
>       diaeresis       inconsistently
> To: "[email protected]" <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8
>
>> This is also a reason to turn Unicode normalization on.  If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
> If I understand the situation correctly, this would only fix this particular
> example and a few others like it. There are many base+combining grapheme
> clusters in Unicode text which cannot be normalized to a single pre-composed
> character. Vietnamese comes to mind.
>
> - JB
>
> On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote:
>
>> Dear Moses,
>>
>>      The attached file, taken from line 2345157 of
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.
>> en.shuffled.gz , tokenizes differently on different machines.
>>
>>      I'm running tokenizer.perl from head (481a07dc) with this perl:
>>
>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for
>> more detail)
>>
>> perl -V is attached from newer machines.
>>
>>      The input is "J?rgen" with a specific encoding:
>>
>> uconv -f utf-8 -x any-name jur
>>
>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN
>> SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>
>> So the umlaut is encoded as a normal "u" character followed by a
>> combining diaeresis marker.  This encoding is legal, but it differs
>> from the single-character canonical encoding of \N{LATIN SMALL LETTER
>> U WITH DIAERESIS}.
>>
>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS} is a single character and recognizing it as part of the
>> IsAlnum class.  Tokenizing on these machines outputs
>>
>> J?rgen
>>
>> Newer machines are treating them separately, recognizing \N{COMBINING
>> DIAERESIS} as a separate character that is not part of IsAlnum.  The
>> Moses tokenizer then treats it as something to split off, yielding
>> this
>> tokenization:
>>
>> Ju ? rgen
>>
>> I thought it might be locale-related but IsAlnum is supposed to be
>> locale-agnostic.  I couldn't come up with environment variables that
>> made the new machines tokenize as a single word.
>>
>> Maybe this is a perl bug, but the result is that two different
>> machines running the same perl script produce different tokenization :-(.
>>
>> This is also a reason to turn Unicode normalization on.  If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
>>
>> Kenneth
>>
>> <jur.gz><perl_V.txt>_______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 30 Dec 2014 10:54:18 +0700
> From: Tom Hoar <[email protected]>
> Subject: Re: [Moses-support] "&apos;" in tokenization
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
>
> The escaping is necessary because Moses reserves these characters for other
> uses. When corpora are consistently prepared, the escaping has no effect on
> translation results. It looks like you have not prepared your corpora
> consistently. Note my results (&apos;s) are different from yours (&apos; s):
>
> user@host:~$ echo "keep your notification's payload under 5 kb." |
> tokenizer.perl -l en Tokenizer Version 1.1
> Language: en
> Number of threads: 1
> keep your notification &apos;s payload under 5 kb .
>
> Go back and double-check how you prepare your training corpus and your
> translation jobs.
>
>
> On 12/29/2014 09:26 PM, Ihab Ramadan wrote:
>> Dears,
>>
>> When I make tokenization on files it replaces the apostrophes with
>> ?&apos;? which make sense, but in the other side it crashes the
>> meaning and the order of the words at all, for example:
>>
>> Sentence before tokenization :
>>
>> Src : keep your notification's payload under 5 kb.
>>
>> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>>
>> Sentence after tokenization :
>>
>> Src: keep your notification &apos; s payload under 5 kb .
>>
>> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>>
>> If I translate ?keep? without using tokenization it will generates
>> ?????? which Is correct but after using tokenization moses generates
>> ????????? which means that the alignment is crashed
>>
>> do I make something wrong?
>>
>> do I miss something or just it is a natural behavior when I use
>> tokenization
>>
>> Thanks
>>
>> Best Regards
>>
>> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
>> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
>> Fax+20233032036 | *Follow us on *linked
>>
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V
> SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
> mary>* |
>> **ZA102637861*
>>
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
> okmark>* |
>> **ZA102637858* <https://twitter.com/Saudisoft>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment.htm
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1314 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment.gif
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1317 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment-0001.gif
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1351 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment-0002.gif
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 98, Issue 65
> *********************************************
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Moses-support Digest, Vol 98, Issue 65

Reply via email to