Awesome glad to have you onboard please add your info to the proposal on the wiki
Sent from my iPhone > On Jan 19, 2016, at 9:02 AM, ben gao <[email protected]> wrote: > > Yes, I am. > > On Tue, Jan 19, 2016, 11:56 AM Mattmann, Chris A (3980) < > [email protected]> wrote: > >> Thanks Martin, filed: >> >> https://github.com/joshua-decoder/joshua/issues/239 >> >> >> Are you interested in joining in the Incubation efforts? >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Martin Gainty <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, January 19, 2016 at 7:57 AM >> To: "[email protected]" <[email protected]>, Lewis McGibbney >> <[email protected]> >> Subject: RE: [DISCUSS] Apache Joshua Incubator Proposal - Machine >> Translation Toolkit >> >>> dependency addition to Joshua-Decoder/Joshua/pom.xml >>> <!-- MCG added args4j for >>> /Joshua-Decoder/Joshua/src/joshua/decoder/ff/tm/CreateGlueGrammar.java:[17 >>> ,26] package org.kohsuke.args4j does not exist error --> <dependency> >>> <groupId>args4j</groupId> <artifactId>args4j</artifactId> >>> <version>2.32</version> </dependency> >>> Joshua-Decoder/Joshua committer please add this dependency to pom.xml >>> thank you/ >>> Martin >>> ______________________________________________ >>> >>> >>> >>>> From: [email protected] >>>> To: [email protected] >>>> CC: [email protected]; [email protected]; [email protected]; >>>> [email protected] >>>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>> Translation Toolkit >>>> Date: Tue, 19 Jan 2016 05:58:26 +0000 >>>> >>>> Great Hen, we’d love to have you on board as a mentor! Please >>>> add yourself to the proposal on the wiki. >>>> >>>> Anyone else have interest in Machine Translation? Any OpenNLP folks, >>>> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility >>>> please feel free to reply to [email protected]. >>>> >>>> I’ll leave the DISCUSS thread open for a few more days. >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Henri Yandell <[email protected]> >>>> Reply-To: "[email protected]" <[email protected]> >>>> Date: Monday, January 18, 2016 at 7:57 PM >>>> To: jpluser <[email protected]>, >>>> "[email protected]" <[email protected]> >>>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>> Translation Toolkit >>>> >>>>> Non-binding +1 to Joshua joining the Incubator. I'd be interested in >>>>> mentoring. >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: jpluser <[email protected]> >>>>>> Reply-To: "[email protected]" >>>> <[email protected]> >>>>>> Date: Tuesday, January 12, 2016 at 10:56 PM >>>>>> To: "[email protected]" <[email protected]> >>>>>> Cc: "[email protected]" <[email protected]> >>>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>>>> Translation >>>>>> Toolkit >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> Please find attached for your viewing pleasure a proposed new >>>> project, >>>>>>> Apache Joshua, a statistical machine translation toolkit. The >>>> proposal >>>>>>> is in wiki draft form at: >>>>>> https://wiki.apache.org/incubator/JoshuaProposal >>>>>>> >>>>>>> Proposal text is copied below. I’ll leave the discussion open for a >>>>>> week >>>>>>> and we are interested in folks who would like to be initial >>>> committers >>>>>>> and mentors. Please discuss here on the thread. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Cheers, >>>>>>> Chris (Champion) >>>>>>> >>>>>>> ——— >>>>>>> >>>>>>> = Joshua Proposal = >>>>>>> >>>>>>> == Abstract == >>>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >>>>>>> translation toolkit. It includes a Java-based decoder for >>>> translating >>>>>> with >>>>>>> phrase-based, hierarchical, and syntax-based translation models, a >>>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of >>>> tools >>>>>> and >>>>>>> scripts for training and evaluating new models from parallel text. >>>>>>> >>>>>>> == Proposal == >>>>>>> Joshua is a state of the art statistical machine translation system >>>>>> that >>>>>>> provides a number of features: >>>>>>> >>>>>>> * Support for the two main paradigms in statistical machine >>>>>> translation: >>>>>>> phrase-based and hierarchical / syntactic. >>>>>>> * A sparse feature API that makes it easy to add new feature >>>> templates >>>>>>> supporting millions of features >>>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and >>>> AdaGrad) >>>>>>> * Support for lattice decoding, allowing upstream NLP tools to >>>> expose >>>>>>> their hypothesis space to the MT system >>>>>>> * An efficient representation for models, allowing for quick >>>> loading >>>>>> of >>>>>>> multi-gigabyte model files >>>>>>> * Fast decoding speed (on par with Moses and mtplz) >>>>>>> * Language packs — precompiled models that allow the decoder to be >>>>>> run as >>>>>>> a black box >>>>>>> * Thrax, a Hadoop-based tool for learning translation models from >>>>>>> parallel text >>>>>>> * A suite of tools for constructing new models for any language >>>> pair >>>>>> for >>>>>>> which sufficient training data exists >>>>>>> >>>>>>> == Background and Rationale == >>>>>>> A number of factors make this a good time for an Apache project >>>>>> focused on >>>>>>> machine translation (MT): the quality of MT output (for many >>>> language >>>>>>> pairs); the average computing resources available on computers, >>>>>> relative >>>>>>> to the needs of MT systems; and the availability of a number of >>>>>>> high-quality toolkits, together with a large base of researchers >>>>>> working >>>>>>> on them. >>>>>>> >>>>>>> Over the past decade, machine translation (MT; the automatic >>>>>> translation >>>>>>> of one human language to another) has become a reality. The research >>>>>> into >>>>>>> statistical approaches to translation that began in the early >>>> nineties, >>>>>>> together with the availability of large amounts of training data, >>>> and >>>>>>> better computing infrastructure, have all come together to produce >>>>>>> translations results that are “good enough” for a large set of >>>> language >>>>>>> pairs and use cases. Free services like >>>>>>> [[https://www.bing.com/translator|Bing Translator]] and >>>>>>> [[https://translate.google.com|Google Translate]] have made these >>>>>> services >>>>>>> available to the average person through direct interfaces and >>>> through >>>>>>> tools like browser plugins, and sites across the world with higher >>>>>>> translation needs use them to translate their pages through >>>>>> automatically. >>>>>>> >>>>>>> MT does not require the infrastructure of large corporations in >>>> order >>>>>> to >>>>>>> produce feasible output. Machine translation can be >>>> resource-intensive, >>>>>>> but need not be prohibitively so. Disk and memory usage are mostly a >>>>>>> matter of model size, which for most language pairs is a few >>>> gigabytes >>>>>> at >>>>>>> most, at which size models can provide coverage on the order of >>>> tens or >>>>>>> even hundreds of thousands of words in the input and output >>>> languages. >>>>>> The >>>>>>> computational complexity of the algorithms used to search for >>>>>> translations >>>>>>> of new sentences are typically linear in the number of words in the >>>>>> input >>>>>>> sentence, making it possible to run a translation engine on a >>>> personal >>>>>>> computer. >>>>>>> >>>>>>> The research community has produced many different open source >>>>>> translation >>>>>>> projects for a range of programming languages and under a variety of >>>>>>> licenses. These projects include the core “decoder”, which takes a >>>>>> model >>>>>>> and uses it to translate new sentences between the language pair the >>>>>> model >>>>>>> was defined for. They also typically include a large set of tools >>>> that >>>>>>> enable new models to be built from large sets of example >>>> translations >>>>>>> (“parallel data”) and monolingual texts. These toolkits are usually >>>>>> built >>>>>>> to support the agendas of the (largely) academic researchers that >>>> build >>>>>>> them: the repeated cycle of building new models, tuning model >>>>>> parameters >>>>>>> against development data, and evaluating them against held-out test >>>>>> data, >>>>>>> using standard metrics for testing the quality of MT output. >>>>>>> >>>>>>> Together, these three factors—the quality of machine translation >>>>>> output, >>>>>>> the feasibility of translating on standard computers, and the >>>>>> availability >>>>>>> of tools to build models—make it reasonable for the end users to use >>>>>> MT as >>>>>>> a black-box service, and to run it on their personal machine. >>>>>>> >>>>>>> These factors make it a good time for an organization with the >>>> status >>>>>> of >>>>>>> the Apache Foundation to host a machine translation project. >>>>>>> >>>>>>> == Current Status == >>>>>>> Joshua was originally ported from David Chiang’s Python >>>> implementation >>>>>> of >>>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >>>>>>> University. The current version is maintained by Matt Post at Johns >>>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has >>>>>> made >>>>>>> many releases with a list of over 20 source code tags. The last >>>>>> release of >>>>>>> Joshua was 6.0.5 on November 5th, 2015. >>>>>>> >>>>>>> == Meritocracy == >>>>>>> The current developers are familiar with meritocratic open source >>>>>>> development at Apache. Apache was chosen specifically because we >>>> want >>>>>> to >>>>>>> encourage this style of development for the project. >>>>>>> >>>>>>> == Community == >>>>>>> Joshua is used widely across the world. Perhaps its biggest (known) >>>>>>> research / industrial user is the Amazon research group in Berlin. >>>>>> Another >>>>>>> user is the US Army Research Lab. No formal census has been >>>> undertaken, >>>>>>> but posts to the Joshua technical support mailing list, along with >>>> the >>>>>>> occasional contributions, suggest small research and academic >>>>>> communities >>>>>>> spread across the world, many of them in India. >>>>>>> >>>>>>> During incubation, we will explicitly seek to increase our usage >>>> across >>>>>>> the board, including academic research, industry, and other end >>>> users >>>>>>> interested in statistical machine translation. >>>>>>> >>>>>>> == Core Developers == >>>>>>> The current set of core developers is fairly small, having fallen >>>> with >>>>>> the >>>>>>> graduation from Johns Hopkins of some core student participants. >>>>>> However, >>>>>>> Joshua is used fairly widely, as mentioned above, and there remains >>>> a >>>>>>> commitment from the principal researcher at Johns Hopkins to >>>> continue >>>>>> to >>>>>>> use and develop it. Joshua has seen a number of new community >>>> members >>>>>>> become interested recently due to a potential for its projected use >>>> in >>>>>> a >>>>>>> number of ongoing DARPA projects such as XDATA and Memex. >>>>>>> >>>>>>> == Alignment == >>>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >>>>>>> rights reserved and licensed under BSD 2-clause license. It would of >>>>>>> course be the intention to relicense this code under AL2.0 which >>>> would >>>>>>> permit expanded and increased use of the software within Apache >>>>>> projects. >>>>>>> There is currently an ongoing effort within the Apache Tika >>>> community >>>>>> to >>>>>>> utilize Joshua within Tika’s Translate API, see >>>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>>>>>> >>>>>>> == Known Risks == >>>>>>> >>>>>>> === Orphaned products === >>>>>>> At the moment, regular contributions are made by a single >>>> contributor, >>>>>> the >>>>>>> lead maintainer. He (Matt Post) plans to continue development for >>>> the >>>>>> next >>>>>>> few years, but it is still a single point of failure, since the >>>>>> graduate >>>>>>> students who worked on the project have moved on to jobs, mostly in >>>>>>> industry. However, our goal is to help that process by growing the >>>>>>> community in Apache, and at least in growing the community with >>>> users >>>>>> and >>>>>>> participants from NASA JPL. >>>>>>> >>>>>>> === Inexperience with Open Source === >>>>>>> The team both at Johns Hopkins and NASA JPL have experience with >>>> many >>>>>> OSS >>>>>>> software projects at Apache and elsewhere. We understand "how it >>>> works" >>>>>>> here at the foundation. >>>>>>> >>>>>>> >>>>>>> == Relationships with Other Apache Products == >>>>>>> Joshua includes dependences on Hadoop, and also is included as a >>>>>> plugin in >>>>>>> Apache Tika. We are also interested in coordinating with other >>>> projects >>>>>>> including Spark, and other projects needing MT services for language >>>>>>> translation. >>>>>>> >>>>>>> == Developers == >>>>>>> Joshua only has one regular developer who is employed by Johns >>>> Hopkins >>>>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing >>>>>>> lately including a Brew formula and other contributions to the >>>> project >>>>>>> through the DARPA XDATA and Memex programs. >>>>>>> >>>>>>> == Documentation == >>>>>>> Documentation and publications related to Joshua can be found at >>>>>>> joshua-decoder.org. The source for the Joshua documentation is >>>>>> currently >>>>>>> hosted on Github at >>>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com >>>>>>> >>>>>>> == Initial Source == >>>>>>> Current source resides at Github: github.com/joshua-decoder/joshua >>>> (the >>>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>>>>> grammar >>>>>>> extraction tool). >>>>>>> >>>>>>> == External Dependencies == >>>>>>> Joshua has a number of external dependencies. Only BerkeleyLM >>>> (Apache >>>>>> 2.0) >>>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of >>>> which is >>>>>>> needed for translating sentences with pre-built models). The rest >>>> are >>>>>>> dependencies for the build system and pipeline, used for >>>> constructing >>>>>> and >>>>>>> training new models from parallel text. >>>>>>> >>>>>>> Apache projects: >>>>>>> * Ant >>>>>>> * Hadoop >>>>>>> * Commons >>>>>>> * Maven >>>>>>> * Ivy >>>>>>> >>>>>>> There are also a number of other open-source projects with various >>>>>>> licenses that the project depends on both dynamically (runtime), and >>>>>>> statically. >>>>>>> >>>>>>> === GNU GPL 2 === >>>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>>>>>> >>>>>>> === LGPG 2.1 === >>>>>>> * KenLM: github.com/kpu/kenlm >>>>>>> >>>>>>> === Apache 2.0 === >>>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>>>>>> >>>>>>> === GNU GPL === >>>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>>>>>> >>>>>>> == Required Resources == >>>>>>> * Mailing Lists >>>>>>> * [email protected] >>>>>>> * [email protected] >>>>>>> * [email protected] >>>>>>> >>>>>>> * Git Repos >>>>>>> * https://git-wip-us.apache.org/repos/asf/joshua.git >>>>>>> >>>>>>> * Issue Tracking >>>>>>> * JIRA Joshua (JOSHUA) >>>>>>> >>>>>>> * Continuous Integration >>>>>>> * Jenkins builds on https://builds.apache.org/ >>>>>>> >>>>>>> * Web >>>>>>> * http://joshua.incubator.apache.org/ >>>>>>> * wiki at http://cwiki.apache.org >>>>>>> >>>>>>> == Initial Committers == >>>>>>> The following is a list of the planned initial Apache committers >>>> (the >>>>>>> active subset of the committers for the current repository on >>>> Github). >>>>>>> >>>>>>> * Matt Post ([email protected]) >>>>>>> * Lewis John McGibbney ([email protected]) >>>>>>> * Chris Mattmann ([email protected]) >>>>>>> >>>>>>> == Affiliations == >>>>>>> >>>>>>> * Johns Hopkins University >>>>>>> * Matt Post >>>>>>> >>>>>>> * NASA JPL >>>>>>> * Chris Mattmann >>>>>>> * Lewis John McGibbney >>>>>>> >>>>>>> >>>>>>> == Sponsors == >>>>>>> === Champion === >>>>>>> * Chris Mattmann (NASA/JPL) >>>>>>> >>>>>>> === Nominated Mentors === >>>>>>> * Paul Ramirez >>>>>>> * Lewis John McGibbney >>>>>>> * Chris Mattmann >>>>>>> >>>>>>> == Sponsoring Entity == >>>>>>> The Apache Incubator >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Chief Architect >>>>>>> Instrument Software and Science Data Systems Section (398) >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 168-519, Mailstop: 168-527 >>>>>>> Email: [email protected] >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Associate Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>>>>> ?B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK >>>>>>> KC >>>>>>> B? >>>> >>>>>>> ???[??X???X?K??K[XZ[????[?\?[?][??X???X?P?[??X?]???\?X??K???B????Y??]? >>>>>>> [? >>>>>>> [? >>>>>>> ???[X[?????K[XZ[????[?\?[?Z?[???[??X?]???\?X??K???B >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >> >>
