Hi thuan,
In general, to build a translation system, you need at least two type of
corpus:
+ monolingual: to train language model
+ bilingual: to train translation model.
You can download Europarl corpus from section download in this link
http://statmt.org/wmt12/translation-task.html to understand what the corpus
be. Normally, monolingual is a file with each line is a sentence in target
language. And bilingual is a pair of two file in different language. each
line in this file is a sentence in one language and is similar with the
sentence in other language.
for Vietnamese corpus, you can send request to vlsp project to get their
data, or you can connect to Lac Viet company to buy it (I think they have a
good corpus, which is edited by expert in the language), or build your own
corpus.
On Thu, Aug 9, 2012 at 4:41 PM, <[email protected]> wrote:
> Send Moses-support mailing list submissions to
> [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
> [email protected]
>
> You can reach the person managing the list at
> [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
> 1. MOSES (thuan pham)
> 2. Code monkey available (thuan pham)
> 3. Re: MOSES (Nisheeth Joshi)
> 4. Factored decoding performance (Michal Kraj?ansk?)
> 5. initialize $_EXTERNAL_BINDIR (Rahma Sellami)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 9 Aug 2012 15:08:17 +0700
> From: thuan pham <[email protected]>
> Subject: [Moses-support] MOSES
> To: [email protected]
> Message-ID:
> <CAPwSibwK6njTerh79OsY-9VX=+
> [email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> t?i c?n s? gi?p ?? c?a m?i ng??i
> t?i ?ang nghi?n c?u v? moses .
> nh?ng hi?n t?i t?i v?n ch?a c?i ??t ???c Moses
> v? t?i v?n ch?a h?nh dung c?i kho ng? li?u c?a moses n? nh? th? n?o.
> n?u c? th? m?i ng??i c? th? chia s? cho t?i m?t ph?n kho ng? li?u v? gi?p
> t?i c?i moses tr?n window 7
>
>
> --
> Pha?m Thu??n
> ?H : ?? N?ng
> L??p : K22
> CN : Khoa H?c M?y T?nh
> Phone : 0935854457
> ?i?a Chi? : 174 Nguy?n Tri Ph??ng - TP ?? N?ng
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20120809/291c4a3c/attachment-0001.htm
>
> ------------------------------
>
> Message: 2
> Date: Thu, 9 Aug 2012 15:36:25 +0700
> From: thuan pham <[email protected]>
> Subject: [Moses-support] Code monkey available
> To: [email protected]
> Message-ID:
> <
> capwsibxscvnscmacs7kz_wd2eu9a0hp2oge8sy4vqgk0kw7...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> ---------- Th? ?? chuy?n ti?p ----------
> T?: thuan pham <[email protected]>
> Ng?y: 15:08 Ng?y 09 th?ng 8 n?m 2012
> Ch? ??: MOSES
> ??n: [email protected]
>
>
> t?i c?n s? gi?p ?? c?a m?i ng??i
> t?i ?ang nghi?n c?u v? moses .
> nh?ng hi?n t?i t?i v?n ch?a c?i ??t ???c Moses
> v? t?i v?n ch?a h?nh dung c?i kho ng? li?u c?a moses n? nh? th? n?o.
> n?u c? th? m?i ng??i c? th? chia s? cho t?i m?t ph?n kho ng? li?u v? gi?p
> t?i c?i moses tr?n window 7
>
>
> --
> Pha?m Thu??n
> ?H : ?? N?ng
> L??p : K22
> CN : Khoa H?c M?y T?nh
> Phone : 0935854457
> ?i?a Chi? : 174 Nguy?n Tri Ph??ng - TP ?? N?ng
>
>
>
>
> --
> Pha?m Thu??n
> ?H : ?? N?ng
> L??p : K22
> CN : Khoa H?c M?y T?nh
> Phone : 0935854457
> ?i?a Chi? : 174 Nguy?n Tri Ph??ng - TP ?? N?ng
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20120809/77b3b7c2/attachment-0001.htm
>
> ------------------------------
>
> Message: 3
> Date: 9 Aug 2012 09:16:21 -0000
> From: "Nisheeth Joshi" <[email protected]>
> Subject: Re: [Moses-support] MOSES
> To: "thuan pham " <[email protected]>
> Cc: moses-support <[email protected]>
> Message-ID:
>
> <1344500051.S.6854.17716.H.WXRodWFuIHBoYW0AW01vc2VzLXN1cHBvcnRdIE1PU0VT.RU.rfs248,
> rfs248, 135,
> [email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi
>
> 1. You can install moses by reading
> http://www.statmt.org/moses/?n=Development.GetStarted This is for
> installation on Linux.
> 2. For Windows 7 installation you can refer
> http://www.statmt.org/moses/uploads/Development/moses-windows.pdf
> 3. You can build a phrase based system using the tutorial available on
> moses' website: http://www.statmt.org/moses/?n=Moses.Baseline
> 4. The tutorial page itself points to the corpus required for training.
>
> For Vietnamese corpus, I sorry I am not aware of any freely available
> corpus.
>
> All the best in your endeavours and please next time use English, so that
> more people would be able to help you. Best then what I have.
>
> Nisheeth
>
> From: thuan pham <[email protected]>
> Sent: Thu, 09 Aug 2012 13:44:11
> To: [email protected]
> Subject: [Moses-support] MOSES
> t?i c?n s? gi?p ?? c?a m?i ng??it?i ?ang nghi?n c?u v? moses .nh?ng hi?n
> t?i t?i v?n ch?a c?i ??t ???c Mosesv? t?i v?n ch?a h?nh dung c?i kho ng?
> li?u c?a moses n? nh? th? n?o.
> n?u c? th? m?i ng??i c? th? chia s? cho t?i m?t ph?n kho ng? li?u v? gi?p
> t?i c?i moses tr?n window 7
> --
> Pha?m Thu??n
> ?H : ?? N?ng
> L??p : K22
> CN : Khoa H?c M?y T?nh
> Phone : 0935854457
> ?i?a Chi? : 174 Nguy?n Tri Ph??ng - TP ?? N?ng
>
>
>
> _______________________________________________
>
> Moses-support mailing list
>
> [email protected]
>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20120809/55927f9c/attachment-0001.htm
>
> ------------------------------
>
> Message: 4
> Date: Thu, 9 Aug 2012 11:35:17 +0200
> From: Michal Kraj?ansk? <[email protected]>
> Subject: [Moses-support] Factored decoding performance
> To: [email protected]
> Message-ID:
> <CAHRYFoO37bc56SZkwS=
> [email protected]>
> Content-Type: text/plain; charset="iso-8859-2"
>
> Hi,
>
> I am experimenting with factored training and I've got a question about
> decoding performance.
> I experience hangups - probably not real hangups, just too long computation
> - while decoding some sentences.
>
> My models are trained on text annotated in the following way:
>
> SURF|LEMM|POS|OTHER
>
> Training is done with the following parameters:
>
> --lm 0:3:news-commentary-v7.cs-en.blm.en:8 \
> --lm 2:3:pos.blm.en:8 \
> --translation-factors 1-1+3-2+0-0,2 \
> --generation-factors 1-2+1,2-0 \
> --decoding-steps t0,g0,t1,g1:t2
>
> I suspect some things that can cause the issue:
>
>
> - the POS tags are in the form of a number: 1-10, when ambiguous, the
> possibilities are separated by comma, so e.g. POS=
> 1
> 3,5
> 7,0,2,6,1
>
> POS tags are more abiguous at the target language side (en), where this is
> about 1/2 of cases.
> I see this can cause the sparsity problem but believe this is not a
> fundamental issue.
>
>
> - sometimes I do not get the values for some factors, so I introduce an
> universal placeholder '_' which I put in place of unknown factors
>
> I imagine this could cause some "sink" problem, in the sense of too common
> token (this is not really uncommon in my data).
>
> I post a sample sentence:
> the|the|3|Node government|government|1|Gsub in|in,inly|7,2,6,1|Node
> Washington|Washington|1|Gsub has|have|5|Node published|publish|5|Pred
> a|a|3|Node prohibition|prohibition|1|Obje to|to|6|Advi that|that|8|Node
> effect|effect|5,1|Node _-_|_|_|Node thereby|thereby|6|Node
> definitively|definitively|6|Node scrapping|scrap|5|Node
> earlier|earlier,early|2,6|Node plans|plan|1|Gsub
>
>
> Could you please point me to the possible problems of this setup?
>
> Thanks in advance and regards,
>
> Michal Kraj?ansk?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20120809/794873ad/attachment-0001.htm
>
> ------------------------------
>
> Message: 5
> Date: Thu, 9 Aug 2012 11:41:12 +0200
> From: Rahma Sellami <[email protected]>
> Subject: [Moses-support] initialize $_EXTERNAL_BINDIR
> To: [email protected]
> Message-ID:
> <CADEYUX+P37DFy9=
> [email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Everyone,
>
> as I am new with Moses, I would very much appreciate help with the
> following issue.
> I try to use the latest version of Moses decoder.
> When I run the script train-model.perl, I have the folowing error.
>
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 213.
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 214.
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 221.
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 222.
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 224.
> Using single-thread GIZA
> Use of uninitialized value $_EXTERNAL_BINDIR in concatenation (.) or string
> at train-model.perl line 306.
> ERROR: Cannot find mkcls, GIZA++/mgiza, & snt2cooc.out/snt2cooc in .
> You MUST specify the parameter -external-bin-dir at train-model.perl line
> 306.
>
> How can I initialize $_EXTERNAL_BINDIR?
>
> Tanks.
>
>
> --
>
> RAHMA Sellami
> PhD Computer Science Student
> http://sites.google.com/site/rahmasellami/
> <http://sites.google.com/site/rahmasellami/>
> Faculty of Economic Sciences and management of Sfax
> ANLP Research Group
> http://sites.google.com/site/anlprg
>
> MIRACL Laboratory
> www.miracl.rnu.tn
>
> Email: [email protected]
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20120809/410c561a/attachment.htm
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 70, Issue 43
> *********************************************
>
--
Thu.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support