Re: [Moses-support] SMT resources for Indian languages

Rajnath Patel Tue, 25 Nov 2014 05:35:19 -0800

Very useful. Adding some more resources, available at -
http://kbcs.in/tools.html


On Tue, Nov 25, 2014 at 4:33 PM, <[email protected]> wrote:

> Send Moses-support mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
>    1. SMT resources for Indian languages (Anoop (?????))
>    2. Re: (no subject) (Hieu Hoang)
>    3. CFP EAMT 2015: 18th Annual Conference of the European
>       Association for Machine Translation (Felipe S?nchez Mart?nez)
>    4. Re: Too large language models - how to handle that? (Hoang Cuong)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 25 Nov 2014 07:59:46 +0530
> From: Anoop (?????)     <[email protected]>
> Subject: [Moses-support] SMT resources for Indian languages
> To: [email protected]
> Message-ID:
>         <
> cadxxmydi98xs8kz6w8c0oevzygb9_faxvb02bl9+-wto9zz...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Sharing a few SMT resources for Indian languages.
>
> Center For Indian Language Technology <http://www.cfilt.iitb.ac.in>, IIT
> Bombay has hosted Shata-Anuvaadak (100 Translators), a Statisitical Machine
> Translation system for Indian languages. It currently supports translation
> between 11 Indian languages:
>
>
>    -     Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi,
>    Marathi, Konkani
>    -     Dravidian languages: Tamil, Telugu, Malayalam
>    -     English
>
>
> It is a Phrase-Based MT system with pre-processing and post-processing
> extensions. The pre-processing includes source-side reordering for English
> to Indian language translation. The post-processing includes
> transliteration between Indian languages for OOV words. The system can be
> accessed at:
>
>         http://www.cfilt.iitb.ac.in/indic-translator
>
> For more details, see the following publication:
>
> Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak
> Bhattacharyya. 2014. * Shata-Anuvadak: Tackling Multiway Translation of
> Indian Languages* . Language and Resources and Evaluation Conference *(LREC
> 2014)*. 2014.
>
> We are also making available software and resources developed in the Center
> for the system and for ongoing research. These are available under an open
> source license for research use. These include:
>
> *Software*
>
>    - Indian Language, NLP tools: Common NLP tools for Indian languages that
>    are useful for machine translation. Unicode Normalizers, Tokenizers,
>    Morphology-analysers and Transliteration systems.
>    - Source Side Reodering system for SMT
>    - A simple experiment management system for Moses
>
> *Resources*
>
>    - Translation Models for Phrase based SMT systems all language pairs in
>    Shata-anuvaadak
>    - Language Models for all language in Shata-anuvaadak
>    - Transliteration models for some language pairs (Moses-based)
>
> You can access these resources at:
>
>     http://www.cfilt.iitb.ac.in/static/download.html
>
> Regards,
> Anoop.
>
> http://www.cse.iitb.ac.in/~anoopk
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/63ea2e27/attachment-0001.htm
>
> ------------------------------
>
> Message: 2
> Date: Tue, 25 Nov 2014 09:10:06 +0000
> From: Hieu Hoang <[email protected]>
> Subject: Re: [Moses-support] (no subject)
> To: Daramola Olaife <[email protected]>, [email protected],
>         [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="windows-1252"
>
> I'm getting a different error when compiling irstlm5.80.06 with the
> latest moses from github.
>     moses/LM/IRST.cpp:60:21: error: invalid use of incomplete type
> ?class lmContainer?
>         if (m_lmtb) m_lmtb->reset_mmap();
>
> Using irstlm5.80.03 works fine
>     http://sourceforge.net/projects/irstlm/files/irstlm/irstlm-5.80/
>
>
> On 24/11/14 12:50, Daramola Olaife wrote:
> > After installing irstlm, I tried linking it to moses with
> > ./bjam --with-irstlm=/home/olaife/irstlm-5.80.06 -j8
> > but it was giving me error.
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/d2ea373d/attachment-0001.htm
>
> ------------------------------
>
> Message: 3
> Date: Tue, 25 Nov 2014 10:12:27 +0100
> From: Felipe S?nchez Mart?nez  <[email protected]>
> Subject: [Moses-support] CFP EAMT 2015: 18th Annual Conference of the
>         European Association for Machine Translation
> To: [email protected], moses-support <[email protected]>,
>         [email protected], [email protected]
> Cc: "awa >> Andy Way" <[email protected]>,  "Mikel L. Forcada"
>         <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
>
> Apologies for cross-posting.
> -----------------------------------------------------------
>
> *18th Annual Conference of the European Association for Machine
> Translation (EAMT 2015; Antalya, Turkey)*
>
> The European Association for Machine Translation
> (EAMT,http://www.eamt.org) invites everyone interested in machine
> translation, translation-related tools and resources to participate in
> this conference ? developers, researchers, users, professional
> translators and translation/localisation managers: anyone who has a
> stake in the vision of an information world in which language barriers
> and issues become less visible to the information consumer. We
> especially invite researchers to describe the state of the art and
> demonstrate their cutting-edge results, and professional MT users to
> share their experiences.
>
> EAMT 2015, the 18th Annual Conference of the European Association for
> Machine Translation, will be held in Antalya, Turkey from 11 to 13 May
> 2015.
>
> We expect to receive manuscripts in these three categories:
>
> ------------------------------------
> Research papers
> ------------------------------------
> Long-paper submissions (8 pages) are invited for reports of significant
> research results in any aspect of machine translation and related areas.
> Such reports should include a substantial evaluation component, or have
> a strong theoretical and/or methodological contribution where results
> and in-depth evaluations may not be appropriate. Papers are welcome on
> all topics in the area of Machine Translation or translation-related
> technologies, including:
>
> * Speech translation: speech to text, speech to speech
> * Translation aids (translation memory, terminology databases, etc.)
> * Translation environments (workflow, support tools, conversion tools
> for lexica, etc.)
> * Practical MT systems (MT for professionals, MT for multilingual
> eCommerce, MT for localization, etc.)
> * MT in multilingual public service (eGovernment etc.)
> * MT for the web
> * MT embedded in other services
> * MT evaluation techniques and evaluation results
> * Dictionaries and lexica for MT
> * Text and speech corpora for MT
> * Standards in text and lexicon encoding for MT
> * Human factors in MT and user interfaces
> * Related multilingual technologies (natural language generation,
> information retrieval, text categorization, text summarization,
> information extraction, etc.)
>
> Papers should describe original work. They should emphasize completed
> work rather than intended work, and should indicate clearly the state of
> completion of the reported results. Where appropriate, concrete
> evaluation results should be included.
>
> ------------------------------------
> User studies
> ------------------------------------
> Short-paper submissions (2-4 pages) are invited for reports on users'
> experiences with MT, be it in small or medium size business (SMB),
> enterprise, government, or NGOs. Contributions are welcome on:
>
> * Integrating MT and computer-assisted translation into a translation
> production workflow (e.g. transforming terminology glossaries into MT
> resources, optimizing TM/MT thresholds, mixing online and offline tools,
> using interactive MT, dealing with MT confidence scores);
> * Use of MT to improve translation or localization workflows (e.g.
> reducing turnaround times, improving translation consistency, increasing
> the scope of globalization projects);
> * Managing change when implementing and using MT (e.g. switching between
> multiple MT systems, limiting degradations when updating or upgrading an
> MT system);
> * Implementing open-source MT in the SMB or enterprise (e.g. strategies
> to get support, reports on taking pilot results into full deployment,
> examples of advance customisation sought and obtained thanks to the
> open-source paradigm, collaboration within open-source MT projects);
> * Evaluation of MT in a real-world setting (e.g. error detection
> strategies employed, metrics used, productivity or translation quality
> gains achieved);
> * Post-editing strategies and tools (e.g. limitations of traditional
> translation quality assurance tools, challenges associated with
> post-editing guidelines);
> * Legal issues associated with MT, especially MT in the cloud (e.g.
> copyright, privacy);
> * Use of MT in social networking or real-time communication (e.g.
> enterprise support chat, multilingual content for social media);
> * Use of MT to process multilingual content for assimilation purposes
> (e.g. cross-lingual information retrieval, MT for e-discovery or spam
> detection, MT for highly dynamic content);
> * Use of standards for MT.
>
> Papers should highlight problems and solutions and not merely describe
> MT integration process or project settings. Where solutions do not seem
> to exist, suggestions for MT researchers and developers should be
> clearly emphasized. For user papers produced by academics, we require
> co-authorship with the actual users.
>
> ------------------------------------
> Project/Product description
> ------------------------------------
> Abstract submissions (1 page) are invited to report new, interesting:
>
> * Tools for machine translation, computer aided translation, and the
> like (including commercial products and open-source software). The
> authors should be ready to present the tools in the form of demos or
> posters during the conference.
> * Research projects related to machine translation. The authors should
> be ready to present the projects in the form of posters during the
> conference. This follows on from the successful ?project villages? held
> at the last two EAMT conferences.
>
> ------------------------------------
> Programme
> ------------------------------------
> The programme will include oral presentations and poster sessions.
> Accepted papers may be assigned to an oral or poster session, but no
> differentiation will be made in the conference proceedings.
>
> ------------------------------------
> Important Dates
> ------------------------------------
> * Paper submission: February 5, 2015
> * Notification to authors: March 12, 2015
> * Camera-ready deadline: April 2, 2015
> * Conference: May 11-13, 2015
>
> ------------------------------------
> Conference website
> ------------------------------------
> http://www.eamt2015.org/
>
> For further information about this call for papers please contact the
> track chairs at [email protected] and put in the title "[user]" or
> "[research]" depending on which track your question is related to. For
> questions about the organisation (venue, registration, accommodation,
> etc.) please contact the local organisers at [email protected].
>
> Kind regards
> --
> Gema Ram?rez-S?nchez, Fred Hollowood and Felipe S?nchez-Mart?nez
> on behalf of the EAMT 2015 Organising Committee
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 25 Nov 2014 12:02:32 +0100
> From: Hoang Cuong <[email protected]>
> Subject: Re: [Moses-support] Too large language models - how to handle
>         that?
> To: Marcin Junczys-Dowmunt <[email protected]>
> Cc: [email protected]
> Message-ID:
>         <CAG1fz7d=
> [email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Raj, Tom and Marcin,
> I binarized the ARPA file last night, following your suggestion. In the
> end, it resulted a binarized LM file of roughly *100GB* (@Marcin - it is
> not 20-30GB as you suggest, is it okay with this size?)
> Fortunately, the infrastructure at my university allows me to run
> experiments with that.
> Thanks a lot for your help.
> It is so great to play with such huge LMs :))
> Best,
>
>
> On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt <
> [email protected]>
> wrote:
>
> >  The command
> >
> > moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm
> >
> > will build a compressed binarized model with quantization. You can run
> >
> > moses/bin/build_binary lm.arpa
> >
> > without any parameters to get size estimates for different parameter
> > settings. I would guess you will get a binarized LM of roughly 20 to 30
> GB
> > which is managable (provided the size you gave us is that of an
> > uncompressed text file). You can also use lmplz to build pruned models in
> > the first place, these will be much smaller.
> >
> > W dniu 2014-11-24 15:11, Tom Hoar napisa?(a):
> >
> > After binarizing such a large ARPA file with KenLM, you'll need to
> > configure your moses.ini file to "lazily load the model using mmap." This
> > involves using lmodel-file code "9" vs code "8." More details here:
> > https://kheafield.com/code/kenlm/moses/
> >
> > Performance improves significantly if you store the binarized file on an
> > SSD.
> >
> >
> >
> >
> > On 11/24/2014 07:00 PM, Raj Dabre wrote:
> >
> >   Hey Hoang,
> > You should binarize the arpa file.
> > The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how.
> > Regards.
> >
> > On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <[email protected]>
> > wrote:
> >
> >> Hi all,
> >> I have trained an (unpruned) 5-grams language model on a large corpus of
> >> 5 billion words, resulting an ARPA-format file of roughly 300GB (is it a
> >> normal LM size with such a big monolingual data?). This is obviously too
> >> big for running an SMT system.
> >> I read several works where their system uses language models trained on
> >> similar monolingual corpus. Could you give me some advice how to handle
> >> this, making it feasible to run SMT systems?
> >> I appreciate your help a lot,
> >> Best,
> >>  --
> >>  Best Regards,
> >>  Hoang Cuong
> >>  SMTNerd
> >>
> >> _______________________________________________
> >> Moses-support mailing list
> >> [email protected]
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>
> >>
> >
> >
> > --
> >  Raj Dabre.
> > Research Student,
> > Graduate School of Informatics,
> > Kyoto University.
> > CSE MTech, IITB., 2011-2014
> >
> >
> > _______________________________________________
> > Moses-support mailing [email protected]://
> mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing [email protected]://
> mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
>
>
> --
>
> *Best Regards,Hoang CuongSMTNerd*
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/439873f3/attachment.htm
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 97, Issue 77
> *********************************************
>



-- 
Regards:
राज नाथ पटेल/Raj Nath Patel
http://kbcs.in/

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] SMT resources for Indian languages

Reply via email to