Precision Translation Tools announces the release of a new open source data cleaning initiative, Corpus Filtergraph CE.
Corpus Filtergraph CE (Community Edition) is a cross-platform, Python toolbox for extracting, filtering and aligning parallel language data statistical machine translation systems. Corpus Filtergraph CE enables users to transform existing translation memories and other documents into parallel training corpora and language models for SMT. The filtergraph manager guides documents through a series of processing nodes. The nodes use a common plug-in API to host almost any open-source library, such as sentence splitters and aligners. One example plugin currently available and we invite the Moses Decoder to participate and develop plugins to share with the larger SMT community. http://www.precisiontranslationtools.com/ptt/tools.html Best regards, Tom Precision Translation Tools Co., Ltd. On Tue, Mar 23, 2010 at 12:03 PM, <[email protected]> wrote: Send Moses-support mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to [email protected] You can reach the person managing the list at [email protected] When replying, please edit your Subject line so it is more specific than "Re: Contents of Moses-support digest..." Today's Topics: 1. Re: Tool for segmenting the sentences of a bilingual corpus (Achim Ruopp) 2. Re: Tool for segmenting the sentences of a bilingual corpus (John Morgan) ---------------------------------------------------------------------- Message: 1 Date: Mon, 22 Mar 2010 15:09:56 -0400 From: Achim Ruopp <[email protected]> Subject: Re: [Moses-support] Tool for segmenting the sentences of a bilingual corpus To: J?rg Tiedemann <[email protected]> Cc: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset=ISO-8859-1 Just in case you need a library - I recently packaged the Europarl sentence splitter and sentence aligner tools into two Perl modules on CPAN: http://search.cpan.org/~achimru/Lingua-Sentence-1.00/ http://search.cpan.org/~achimru/Text-GaleChurch-1.00/ Achim 2010/3/22 J?rg Tiedemann <[email protected]>: > > Europarl comes with a sentence aligner: > http://statmt.org/europarl/v5/tools.tgz > > You can also use hunalign: > http://mokk.bme.hu/resources/hunalign > (look at the "realign" feature for lexical matching) > GMA: > http://nlp.cs.nyu.edu/GMA/ > > Uplug includes all three and also a tool for interactive > (semi-automatic) sentence alignment: > http://sourceforge.net/projects/uplug/ > http://www.let.rug.nl/~tiedeman/Uplug/php/ > > > J?rg > > > Raphael Payen wrote: >> Hi >> >>>From what I've seen, moses, even with all the tools that go with it, >> requires a sentence-aligned bilingual corpus as its input. What if we >> only have an unaligned parallel corpus ? Do you know if there are >> tools available to do this sentence-level alignment ? There seems to >> be something in python-nltk, based on Gale & Church, but it is recent >> and not yet completely part of the package. Besides, Gale & Church >> algorithm uses only sentence lengths, probably there exist more >> powerful algorithms, using dictionaries of word alignment information >> ? (I mean "static" dictionaries provided beforehand; I guess >> theoretically there could be ways to "dynamically" use a word aligner >> like giza on an unaligned corpus, compute some word alignments, use >> them to compute the sentence alignements, and feed this to itself, but >> static dictionaries seem more practical). >> >> Also, since this step usually requires human supervision, do you know >> if there are there open-source / unix GUI tools to assist in editing >> the alignements proposed ? (comparable to Trados WinAlign) ? >> >> Best regards, >> > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > ------------------------------ Message: 2 Date: Mon, 22 Mar 2010 21:40:10 -0400 From: "John Morgan" <[email protected]> Subject: Re: [Moses-support] Tool for segmenting the sentences of a bilingual corpus To: " 'J?rg Tiedemann' " <[email protected]>, "'Raphael Payen'" <[email protected]> Cc: [email protected] Message-ID: <005401caca29$c6c57660$0201a...@surubi> Content-Type: text/plain; charset=iso-8859-1 I think the bilingual sentence aligner by Bob Moore of Microsoft does what you want. http://research.microsoft.com/en-us/people/bobmoore/ J -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of J?rg Tiedemann Sent: Monday, March 22, 2010 11:36 AM To: Raphael Payen Cc: [email protected] Subject: Re: [Moses-support] Tool for segmenting the sentences of a bilingual corpus Europarl comes with a sentence aligner: http://statmt.org/europarl/v5/tools.tgz You can also use hunalign: http://mokk.bme.hu/resources/hunalign (look at the "realign" feature for lexical matching) GMA: http://nlp.cs.nyu.edu/GMA/ Uplug includes all three and also a tool for interactive (semi-automatic) sentence alignment: http://sourceforge.net/projects/uplug/ http://www.let.rug.nl/~tiedeman/Uplug/php/ J?rg Raphael Payen wrote: > Hi > >>From what I've seen, moses, even with all the tools that go with it, > requires a sentence-aligned bilingual corpus as its input. What if we > only have an unaligned parallel corpus ? Do you know if there are > tools available to do this sentence-level alignment ? There seems to > be something in python-nltk, based on Gale & Church, but it is recent > and not yet completely part of the package. Besides, Gale & Church > algorithm uses only sentence lengths, probably there exist more > powerful algorithms, using dictionaries of word alignment information > ? (I mean "static" dictionaries provided beforehand; I guess > theoretically there could be ways to "dynamically" use a word aligner > like giza on an unaligned corpus, compute some word alignments, use > them to compute the sentence alignements, and feed this to itself, but > static dictionaries seem more practical). > > Also, since this step usually requires human supervision, do you know > if there are there open-source / unix GUI tools to assist in editing > the alignements proposed ? (comparable to Trados WinAlign) ? > > Best regards, > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support Internal Virus Database is out of date. Checked by AVG - www.avg.com Version: 9.0.707 / Virus Database: 270.14.67/2505 - Release Date: 11/15/09 15:50:00 ------------------------------ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 41, Issue 31 ********************************************* _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
