Re: [Moses-support] Tokenization problem

Ihab Ramadan Thu, 15 Jan 2015 00:09:42 -0800

Many thanks for all of you
As you mentioned the problem is not in the script it was in the text sent to
the terminal from my web app, I found that some characters does not goes as
it with weird Unicode  
Thanks everybody


-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of [email protected]
Sent: Thursday, January 15, 2015 3:39 AM
To: [email protected]
Subject: Moses-support Digest, Vol 99, Issue 28

Send Moses-support mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Moses-support digest..."


Today's Topics:

   1. how to align some new parallel sentences using a  trained
      model (iamzcy_hit iamzcy_hit)
   2. Re: Tokenization problem (Tom Hoar)
   3. Re: Tokenization problem (Kenneth Heafield)


----------------------------------------------------------------------

Message: 1
Date: Thu, 15 Jan 2015 08:54:06 +0800
From: iamzcy_hit iamzcy_hit <[email protected]>
Subject: [Moses-support] how to align some new parallel sentences
        using a trained model
To: "[email protected]" <[email protected]>
Message-ID:
        <CAGLowvLWHXb_J+=vZqMeOVCOD7Z=Uzyz_Sn=yjv+ptsfsyv...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,all
      If I've train a alignment model using a huge parallel corpus with the
help of giga++,mgiga or fast-align, now I am given some new sentences pairs
and want to align the words in the sentence, how should I do ?
      Best regards

--
???????????????.....
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/9f
3850f8/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 15 Jan 2015 08:33:17 +0700
From: Tom Hoar <[email protected]>
Subject: Re: [Moses-support] Tokenization problem
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="windows-1252"

I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
./tokenizer.perl -no-escape -q -l en < test.txt which will guide you through
connecting and configuring your printer 's wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .

This is not a Perl script problem. What shell and command line are you using
for your "in the file" results? You'll find the problem in either your shell
or your custom tool chain(s) before you run tokenizer.perl.



On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>
> Dears,
>
> I still have this problem, for not confusing the decoder I used the 
> ??no-escape? parameter in the tokenizer.perl script but still have the 
> problem of adding extra space after quotations for tokenizing files 
> however in tokenizing a segment it comes without the extra space
>
> For example
>
> In the file
>
> ?which will guide you through connecting and configuring your 
> printer's wireless connection. ? ??which will guide you through 
> connecting and configuring your printer ' s wireless connection .?
>
> As a segment
>
> ?which will guide you through connecting and configuring your 
> printer's wireless connection. ? ??which will guide you through 
> connecting and configuring your printer 's wireless connection .?
>
> I wonder if it is the same script why it generated two different 
> outputs
>
> I have no experience in perl so I could not get the line of code which 
> differ between if the segment in a file or just one segment passed as 
> a parameter to the script
>
> Please help
>
> *From:*Ihab Ramadan [mailto:[email protected]]
> *Sent:* Monday, January 5, 2015 10:09 AM
> *To:* [email protected]
> *Subject:* Tokenization problem
>
> Dears,
>
> Using the tokenizer on the training files replaces the apostrophes 
> with ?&apos; s? (with space) but if I use the same script to tokenize 
> a sentence it makes the apostrophes to be ?&apos;s? (without a space)
>
> This problem confuse the decoder while translation
>
> How to solve this peoblem
>
> Thanks
>
> Best Regards
>
> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
> Fax+20233032036 | *Follow us on *linked
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trk
> Info=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVS
> RPcmpt%3Aprimary>* |
> **ZA102637861*
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_t
> ype=bookmark>* |
> **ZA102637858* <https://twitter.com/Saudisoft>
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0003.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0004.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0005.gif

------------------------------

Message: 3
Date: Wed, 14 Jan 2015 20:39:14 -0500
From: Kenneth Heafield <[email protected]>
Subject: Re: [Moses-support] Tokenization problem
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252

I'll inject that it is plausible there is some weird Unicode going on there
and copy-paste on Linux sometimes canonicalized graphemes.  Whilst I'm
inclined to side with Tom, the only way to sort this out is with the raw
file from Ihab as e.g. a gzipped attachment.

Kenneth

On 01/14/2015 08:33 PM, Tom Hoar wrote:
> I just ran the same sentence through the newest github clone (today).
> 
> corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
> ./tokenizer.perl -no-escape -q -l en < test.txt which will guide you 
> through connecting and configuring your printer 's wireless connection 
> .
> which will guide you through connecting and configuring your printer 
> 's wireless connection .
> which will guide you through connecting and configuring your printer 
> 's wireless connection .
> which will guide you through connecting and configuring your printer 
> 's wireless connection .
> which will guide you through connecting and configuring your printer 
> 's wireless connection .
> 
> This is not a Perl script problem. What shell and command line are you 
> using for your "in the file" results? You'll find the problem in 
> either your shell or your custom tool chain(s) before you run
tokenizer.perl.
> 
> 
> 
> On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>>
>> Dears,
>>
>> I still have this problem, for not confusing the decoder I used the 
>> ??no-escape? parameter in the tokenizer.perl script but still have 
>> the problem of adding extra space after quotations for tokenizing 
>> files however in tokenizing a segment it comes without the extra 
>> space
>>
>> For example
>>
>> In the file
>>
>> ?which will guide you through connecting and configuring your 
>> printer's wireless connection. ? ??which will guide you through 
>> connecting and configuring your printer ' s wireless connection .?
>>
>> As a segment
>>
>> ?which will guide you through connecting and configuring your 
>> printer's wireless connection. ? ??which will guide you through 
>> connecting and configuring your printer 's wireless connection .?
>>
>> I wonder if it is the same script why it generated two different 
>> outputs
>>
>> I have no experience in perl so I could not get the line of code 
>> which differ between if the segment in a file or just one segment 
>> passed as a parameter to the script
>>
>> Please help
>>
>>  
>>
>>  
>>
>>  
>>
>> *From:*Ihab Ramadan [mailto:[email protected]]
>> *Sent:* Monday, January 5, 2015 10:09 AM
>> *To:* [email protected]
>> *Subject:* Tokenization problem
>>
>>  
>>
>> Dears,
>>
>> Using the tokenizer on the training files replaces the apostrophes 
>> with ?&apos; s? (with space) but if I use the same script to tokenize 
>> a sentence it makes the apostrophes to be ?&apos;s? (without a space)
>>
>> This problem confuse the decoder while translation
>>
>> How to solve this peoblem
>>
>> Thanks
>>
>>  
>>
>> Best Regards
>>
>> /Ihab Ramadan/| Senior Developer|Saudisoft 
>> <http://www.saudisoft.com/>
>> - Egypt| *Tel * +2 02 330 320 37  Ext- 0| Mob+201007570826 |
>> Fax+20233032036 | *Follow us on *linked
>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&tr
>> kInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2C
>> VSRPcmpt%3Aprimary>* |
>> **ZA102637861*
>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_
>> type=bookmark>* |
>> **ZA102637858* <https://twitter.com/Saudisoft>
>>
>>  
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 


------------------------------

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 28
*********************************************


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

Reply via email to