Thanks Tom for your reply,
I think I found where is the problem, when I use the tokenizer.perl script
to tokenize a string it generates the output you mentioned like 
" keep your notification 's payload under 5 kb ." but if use the
tokenizer.perl script to process a file the output will be 
" keep your notification ' s payload under 5 kb ." which adds a space
between ' and s and this makes some translation problems
Can you please tell me why this happens
Thanks

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of [email protected]
Sent: Tuesday, December 30, 2014 5:56 AM
To: [email protected]
Subject: Moses-support Digest, Vol 98, Issue 65

Send Moses-support mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Moses-support digest..."


Today's Topics:

   1. Moses tokenizer treats combining diaeresis        inconsistently
      (Kenneth Heafield)
   2. Re: Moses tokenizer treats combining diaeresis    inconsistently
      (John D Burger)
   3. Re: "'" in tokenization (Tom Hoar)


----------------------------------------------------------------------

Message: 1
Date: Mon, 29 Dec 2014 16:05:51 -0500
From: Kenneth Heafield <[email protected]>
Subject: [Moses-support] Moses tokenizer treats combining diaeresis
        inconsistently
To: "[email protected]" <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset="utf-8"

Dear Moses,

        The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu
ffled.gz
, tokenizes differently on different machines.

        I'm running tokenizer.perl from head (481a07dc) with this perl:

This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more
detail)

perl -V is attached from newer machines.

        The input is "J?rgen" with a specific encoding:

uconv -f utf-8 -x any-name jur

\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}

So the umlaut is encoded as a normal "u" character followed by a combining
diaeresis marker.  This encoding is legal, but it differs from the
single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.

Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}
is a single character and recognizing it as part of the IsAlnum class.
Tokenizing on these machines outputs

J?rgen

Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum.  The Moses
tokenizer then treats it as something to split off, yielding this
tokenization:

Ju ? rgen

I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic.  I couldn't come up with environment variables that made
the new machines tokenize as a single word.

Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.

This is also a reason to turn Unicode normalization on.  If the tokenizer
did NFKC at the beginning, then the problem would go away.

Kenneth

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jur.gz
Type: application/gzip
Size: 33 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c
e44a08/attachment-0001.bin
-------------- next part --------------
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
    uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64
intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
    config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread
-Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
-Dprivlib=/usr/lib64/perl5/5.18.2
-Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dsitelib=/usr/local/lib64/perl5/5.18.2
-Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2
-Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2
-Dlocincpth=/usr/include  -Dglibpth=/lib64 /usr/lib64  -Duselargefiles
-Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
-Dperladmin=root@loca!
 lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db
-Dusethreads -DDEBUGGING=none
-Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0
5.18.1/x86_64-linux-thread-multi 5.18.1  -Dlibpth=/usr/local/lib64 /lib64
/usr/lib64 -Dnoextensions=ODBM_File'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 -march=native -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
    ccversion='', gccversion='4.7.3', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
    libpth=/usr/local/lib64 /lib64 /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
-lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.19.so, so=so, useshrplib=true,
libperl=libperl.so.5.18.2
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1
-Wl,--as-needed'


Characteristics of this binary (from libperl): 
  Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
                        PERL_DONT_CREATE_GVSV
                        PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
                        PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
                        USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
                        USE_REENTRANT_API
  Locally applied patches:
        gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
        gentoo/EUMM_delete_packlist - Don't install .packlist or
perllocal.pod for perl or vendor
        gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
        gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
for modules installed from CPAN.
        gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
directories by default.
        gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set
libperl soname
        gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't
force -fstack-protector on everyone.
        gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
@INC directories.
        gentoo/mod_paths - Add /etc/perl to @INC
        gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in
patchlevel.h
        gentoo/aix_soname - aix gcc detection and shared library soname
support
        gentoo/opensolars_headers - Add headers for opensolaris
        gentoo/cleanup-paths - Cleanup PATH and shrpenv
        gentoo/usr_local - Remove /usr/local paths
        gentoo/hints_hpux - Fix hpux hints
        gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC
to link
        gentoo/interix - Fix interix hints
        fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
'Port' option
        debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
nonexisting site dirs if a parent is writable
        fixes/memoize_storable_nstore - [rt.cpan.org #77790]
Memoize::Storable: respect 'nstore' option not respected
        fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope
gracefully with a failed command
        fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look
up the list of local patches at run time
        fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576]
untaint version, if needed, in Module::Metadata
        fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of
IPC_CREAT in IPC-SysV documentation
        fixes/freemint -
  Built under linux
  Compiled at Oct 29 2014 20:59:02
  @INC:
    /etc/perl
    /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
    /usr/local/lib64/perl5/5.18.2
    /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.18.2
    /usr/local/lib64/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
    /usr/lib64/perl5/5.18.2
    .

------------------------------

Message: 2
Date: Mon, 29 Dec 2014 16:40:42 -0500
From: John D Burger <[email protected]>
Subject: Re: [Moses-support] Moses tokenizer treats combining
        diaeresis       inconsistently
To: "[email protected]" <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

> This is also a reason to turn Unicode normalization on.  If the 
> tokenizer did NFKC at the beginning, then the problem would go away.

If I understand the situation correctly, this would only fix this particular
example and a few others like it. There are many base+combining grapheme
clusters in Unicode text which cannot be normalized to a single pre-composed
character. Vietnamese comes to mind.

- JB

On Dec 29, 2014, at 16:05 , Kenneth Heafield <[email protected]> wrote:

> Dear Moses,
> 
>       The attached file, taken from line 2345157 of 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.
> en.shuffled.gz , tokenizes differently on different machines.
> 
>       I'm running tokenizer.perl from head (481a07dc) with this perl:
> 
> This is perl 5, version 18, subversion 2 (v5.18.2) built for 
> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for 
> more detail)
> 
> perl -V is attached from newer machines.
> 
>       The input is "J?rgen" with a specific encoding:
> 
> uconv -f utf-8 -x any-name jur
> 
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING 
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN 
> SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
> 
> So the umlaut is encoded as a normal "u" character followed by a 
> combining diaeresis marker.  This encoding is legal, but it differs 
> from the single-character canonical encoding of \N{LATIN SMALL LETTER 
> U WITH DIAERESIS}.
> 
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING 
> DIAERESIS} is a single character and recognizing it as part of the 
> IsAlnum class.  Tokenizing on these machines outputs
> 
> J?rgen
> 
> Newer machines are treating them separately, recognizing \N{COMBINING 
> DIAERESIS} as a separate character that is not part of IsAlnum.  The 
> Moses tokenizer then treats it as something to split off, yielding 
> this
> tokenization:
> 
> Ju ? rgen
> 
> I thought it might be locale-related but IsAlnum is supposed to be 
> locale-agnostic.  I couldn't come up with environment variables that 
> made the new machines tokenize as a single word.
> 
> Maybe this is a perl bug, but the result is that two different 
> machines running the same perl script produce different tokenization :-(.
> 
> This is also a reason to turn Unicode normalization on.  If the 
> tokenizer did NFKC at the beginning, then the problem would go away.
> 
> Kenneth
> 
> <jur.gz><perl_V.txt>_______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support




------------------------------

Message: 3
Date: Tue, 30 Dec 2014 10:54:18 +0700
From: Tom Hoar <[email protected]>
Subject: Re: [Moses-support] "&apos;" in tokenization
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="utf-8"


The escaping is necessary because Moses reserves these characters for other
uses. When corpora are consistently prepared, the escaping has no effect on
translation results. It looks like you have not prepared your corpora
consistently. Note my results (&apos;s) are different from yours (&apos; s):

user@host:~$ echo "keep your notification's payload under 5 kb." |
tokenizer.perl -l en Tokenizer Version 1.1
Language: en
Number of threads: 1
keep your notification &apos;s payload under 5 kb .

Go back and double-check how you prepare your training corpus and your
translation jobs.


On 12/29/2014 09:26 PM, Ihab Ramadan wrote:
>
> Dears,
>
> When I make tokenization on files it replaces the apostrophes with 
> ?&apos;? which make sense, but in the other side it crashes the 
> meaning and the order of the words at all, for example:
>
> Sentence before tokenization :
>
> Src : keep your notification's payload under 5 kb.
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> Sentence after tokenization :
>
> Src: keep your notification &apos; s payload under 5 kb .
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> If I translate ?keep? without using tokenization it will generates 
> ?????? which Is correct but after using tokenization moses generates 
> ????????? which means that the alignment is crashed
>
> do I make something wrong?
>
> do I miss something or just it is a natural behavior when I use 
> tokenization
>
> Thanks
>
> Best Regards
>
> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> 
> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | 
> Fax+20233032036 | *Follow us on *linked 
>
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary>* | 
> **ZA102637861* 
>
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark>* | 
> **ZA102637858* <https://twitter.com/Saudisoft>
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment-0002.gif

------------------------------

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 98, Issue 65
*********************************************


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to