Re: [Moses-support] filter parallel corpus

Amin Farajian Thu, 16 Jan 2014 08:19:06 -0800

Dear Saeed,

You can do the data selection using IRSTLM. I think it fits your need. Take
a look at the following link:
http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Data_selection

It helps you to find the subset of sentences within your large training
corpus that fits better with your test corpus.
Note that it is originally designed for the monolingual scenario. But, If
you want to filter the parallel corpus, you can do the following:

1. add line numbers to the beginning of the lines of the source side of
your training corpus.
2. Do the data selection as is described in the manual
3. Extract the corresponding translations of the selected source lines.
4. Enjoy life

Bests,
Amin

On Thu, Jan 16, 2014 at 4:43 PM, Saeed Farzi <[email protected]> wrote:

> Dear all,
>
> I am working on a translation task with a very large parallel corpus.
> Because of computational cost of training such a parallel corpus, i am
> going to filter it regarding to the test set ( of course , by the
> filtering, the evaluation must be still fair).
>
> I am looking for  a solution  or a tool for filtering parallel corpus
> sentences.
>
> Note that  i do not need to filter phrase table. I know that the
> filter_ moses tool reduces the phrase table size.
>
> cheers
> --
>            S.Farzi, Ph.D. Student
>     Natural Language Processing Lab,
>   School of Electrical and Computer Eng.,
>                Tehran University
>              Tel: +9821-6111-9719
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] filter parallel corpus

Reply via email to