Dear Saeed, You can do the data selection using IRSTLM. I think it fits your need. Take a look at the following link: http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Data_selection
It helps you to find the subset of sentences within your large training corpus that fits better with your test corpus. Note that it is originally designed for the monolingual scenario. But, If you want to filter the parallel corpus, you can do the following: 1. add line numbers to the beginning of the lines of the source side of your training corpus. 2. Do the data selection as is described in the manual 3. Extract the corresponding translations of the selected source lines. 4. Enjoy life Bests, Amin On Thu, Jan 16, 2014 at 4:43 PM, Saeed Farzi <[email protected]> wrote: > Dear all, > > I am working on a translation task with a very large parallel corpus. > Because of computational cost of training such a parallel corpus, i am > going to filter it regarding to the test set ( of course , by the > filtering, the evaluation must be still fair). > > I am looking for a solution or a tool for filtering parallel corpus > sentences. > > Note that i do not need to filter phrase table. I know that the > filter_ moses tool reduces the phrase table size. > > cheers > -- > S.Farzi, Ph.D. Student > Natural Language Processing Lab, > School of Electrical and Computer Eng., > Tehran University > Tel: +9821-6111-9719 > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
