Dear Moses,

        The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
, tokenizes differently on different machines.

        I'm running tokenizer.perl from head (481a07dc) with this perl:

This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi
(with 25 registered patches, see perl -V for more detail)

perl -V is attached from newer machines.

        The input is "Jürgen" with a specific encoding:

uconv -f utf-8 -x any-name jur

\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}

So the umlaut is encoded as a normal "u" character followed by a
combining diaeresis marker.  This encoding is legal, but it differs from
the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.

Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS} is a single character and recognizing it as part of the
IsAlnum class.  Tokenizing on these machines outputs

Jürgen

Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum.  The
Moses tokenizer then treats it as something to split off, yielding this
tokenization:

Ju ̈ rgen

I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic.  I couldn't come up with environment variables that
made the new machines tokenize as a single word.

Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.

This is also a reason to turn Unicode normalization on.  If the
tokenizer did NFKC at the beginning, then the problem would go away.

Kenneth

Attachment: jur.gz
Description: application/gzip

Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
    uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 
intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
    config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread 
-Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe 
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr 
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin 
-Dprivlib=/usr/lib64/perl5/5.18.2 
-Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi 
-Dsitelib=/usr/local/lib64/perl5/5.18.2 
-Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi 
-Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 
-Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi 
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include  
-Dglibpth=/lib64 /usr/lib64  -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo 
-Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh 
-Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none 
-Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 
5.18.1/x86_64-linux-thread-multi 5.18.1  -Dlibpth=/usr/local/lib64 /lib64 
/usr/lib64 -Dnoextensions=ODBM_File'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE 
-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 -march=native -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
    ccversion='', gccversion='4.7.3', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
    libpth=/usr/local/lib64 /lib64 /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 
-Wl,--as-needed'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
                        PERL_DONT_CREATE_GVSV
                        PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
                        PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
                        USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
                        USE_REENTRANT_API
  Locally applied patches:
        gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054 
cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
        gentoo/EUMM_delete_packlist - Don't install .packlist or perllocal.pod 
for perl or vendor
        gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
        gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for 
modules installed from CPAN.
        gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site 
directories by default.
        gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set 
libperl soname
        gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't 
force -fstack-protector on everyone.
        gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing @INC 
directories.
        gentoo/mod_paths - Add /etc/perl to @INC
        gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in 
patchlevel.h
        gentoo/aix_soname - aix gcc detection and shared library soname support
        gentoo/opensolars_headers - Add headers for opensolaris
        gentoo/cleanup-paths - Cleanup PATH and shrpenv
        gentoo/usr_local - Remove /usr/local paths
        gentoo/hints_hpux - Fix hpux hints
        gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC to 
link
        gentoo/interix - Fix interix hints
        fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP 
'Port' option
        debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with 
nonexisting site dirs if a parent is writable
        fixes/memoize_storable_nstore - [rt.cpan.org #77790] Memoize::Storable: 
respect 'nstore' option not respected
        fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope 
gracefully with a failed command
        fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look up 
the list of local patches at run time
        fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576] 
untaint version, if needed, in Module::Metadata
        fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of 
IPC_CREAT in IPC-SysV documentation
        fixes/freemint -
  Built under linux
  Compiled at Oct 29 2014 20:59:02
  @INC:
    /etc/perl
    /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
    /usr/local/lib64/perl5/5.18.2
    /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.18.2
    /usr/local/lib64/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
    /usr/lib64/perl5/5.18.2
    .
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to