Dear Moses,
The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
, tokenizes differently on different machines.
I'm running tokenizer.perl from head (481a07dc) with this perl:
This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi
(with 25 registered patches, see perl -V for more detail)
perl -V is attached from newer machines.
The input is "Jürgen" with a specific encoding:
uconv -f utf-8 -x any-name jur
\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
So the umlaut is encoded as a normal "u" character followed by a
combining diaeresis marker. This encoding is legal, but it differs from
the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.
Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS} is a single character and recognizing it as part of the
IsAlnum class. Tokenizing on these machines outputs
Jürgen
Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum. The
Moses tokenizer then treats it as something to split off, yielding this
tokenization:
Ju ̈ rgen
I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic. I couldn't come up with environment variables that
made the new machines tokenize as a single word.
Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.
This is also a reason to turn Unicode normalization on. If the
tokenizer did NFKC at the beginning, then the problem would go away.
Kenneth
jur.gz
Description: application/gzip
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
Platform:
osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64
intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread
-Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
-Dprivlib=/usr/lib64/perl5/5.18.2
-Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dsitelib=/usr/local/lib64/perl5/5.18.2
-Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2
-Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include
-Dglibpth=/lib64 /usr/lib64 -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo
-Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh
-Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none
-Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0
5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64
/usr/lib64 -Dnoextensions=ODBM_File'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O3 -march=native -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
ccversion='', gccversion='4.7.3', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
libpth=/usr/local/lib64 /lib64 /usr/lib64
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2
gnulibc_version='2.19'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1
-Wl,--as-needed'
Characteristics of this binary (from libperl):
Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
USE_REENTRANT_API
Locally applied patches:
gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
gentoo/EUMM_delete_packlist - Don't install .packlist or perllocal.pod
for perl or vendor
gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for
modules installed from CPAN.
gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
directories by default.
gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set
libperl soname
gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't
force -fstack-protector on everyone.
gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing @INC
directories.
gentoo/mod_paths - Add /etc/perl to @INC
gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in
patchlevel.h
gentoo/aix_soname - aix gcc detection and shared library soname support
gentoo/opensolars_headers - Add headers for opensolaris
gentoo/cleanup-paths - Cleanup PATH and shrpenv
gentoo/usr_local - Remove /usr/local paths
gentoo/hints_hpux - Fix hpux hints
gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC to
link
gentoo/interix - Fix interix hints
fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
'Port' option
debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
nonexisting site dirs if a parent is writable
fixes/memoize_storable_nstore - [rt.cpan.org #77790] Memoize::Storable:
respect 'nstore' option not respected
fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope
gracefully with a failed command
fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look up
the list of local patches at run time
fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576]
untaint version, if needed, in Module::Metadata
fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of
IPC_CREAT in IPC-SysV documentation
fixes/freemint -
Built under linux
Compiled at Oct 29 2014 20:59:02
@INC:
/etc/perl
/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/local/lib64/perl5/5.18.2
/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/vendor_perl/5.18.2
/usr/local/lib64/perl5
/usr/lib64/perl5/vendor_perl
/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/5.18.2
.
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
