+1 binding Should be a very interesting project!
On Sat, Jan 30, 2016 at 8:05 PM, Ashish <paliwalash...@gmail.com> wrote: > + (non-binding) > > On Sat, Jan 30, 2016 at 12:00 PM, Mattmann, Chris A (3980) > <chris.a.mattm...@jpl.nasa.gov> wrote: > > Hi Everyone, > > > > OK the discussion is now completed. Please VOTE to accept Joshua > > into the Apache Incubator. I’ll leave the VOTE open for at least > > the next 72 hours, with hopes to close it next Friday the 5th of > > February, 2016. > > > > [ ] +1 Accept Joshua as an Apache Incubator podling. > > [ ] +0 Abstain. > > [ ] -1 Don’t accept Joshua as an Apache Incubator podling because.. > > > > Of course, I am +1 on this. Please note VOTEs from Incubator PMC > > members are binding but all are welcome to VOTE! > > > > Cheers, > > Chris > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > -----Original Message----- > > From: jpluser <chris.a.mattm...@jpl.nasa.gov> > > Date: Tuesday, January 12, 2016 at 10:56 PM > > To: "general@incubator.apache.org" <general@incubator.apache.org> > > Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> > > Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation > > Toolkit > > > >>Hi Everyone, > >> > >>Please find attached for your viewing pleasure a proposed new project, > >>Apache Joshua, a statistical machine translation toolkit. The proposal > >>is in wiki draft form at: > https://wiki.apache.org/incubator/JoshuaProposal > >> > >>Proposal text is copied below. I’ll leave the discussion open for a week > >>and we are interested in folks who would like to be initial committers > >>and mentors. Please discuss here on the thread. > >> > >>Thanks! > >> > >>Cheers, > >>Chris (Champion) > >> > >>——— > >> > >>= Joshua Proposal = > >> > >>== Abstract == > >>[[joshua-decoder.org|Joshua]] is an open-source statistical machine > >>translation toolkit. It includes a Java-based decoder for translating > with > >>phrase-based, hierarchical, and syntax-based translation models, a > >>Hadoop-based grammar extractor (Thrax), and an extensive set of tools and > >>scripts for training and evaluating new models from parallel text. > >> > >>== Proposal == > >>Joshua is a state of the art statistical machine translation system that > >>provides a number of features: > >> > >> * Support for the two main paradigms in statistical machine translation: > >>phrase-based and hierarchical / syntactic. > >> * A sparse feature API that makes it easy to add new feature templates > >>supporting millions of features > >> * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad) > >> * Support for lattice decoding, allowing upstream NLP tools to expose > >>their hypothesis space to the MT system > >> * An efficient representation for models, allowing for quick loading of > >>multi-gigabyte model files > >> * Fast decoding speed (on par with Moses and mtplz) > >> * Language packs — precompiled models that allow the decoder to be run > as > >>a black box > >> * Thrax, a Hadoop-based tool for learning translation models from > >>parallel text > >> * A suite of tools for constructing new models for any language pair for > >>which sufficient training data exists > >> > >>== Background and Rationale == > >>A number of factors make this a good time for an Apache project focused > on > >>machine translation (MT): the quality of MT output (for many language > >>pairs); the average computing resources available on computers, relative > >>to the needs of MT systems; and the availability of a number of > >>high-quality toolkits, together with a large base of researchers working > >>on them. > >> > >>Over the past decade, machine translation (MT; the automatic translation > >>of one human language to another) has become a reality. The research into > >>statistical approaches to translation that began in the early nineties, > >>together with the availability of large amounts of training data, and > >>better computing infrastructure, have all come together to produce > >>translations results that are “good enough” for a large set of language > >>pairs and use cases. Free services like > >>[[https://www.bing.com/translator|Bing Translator]] and > >>[[https://translate.google.com|Google Translate]] have made these > services > >>available to the average person through direct interfaces and through > >>tools like browser plugins, and sites across the world with higher > >>translation needs use them to translate their pages through > automatically. > >> > >>MT does not require the infrastructure of large corporations in order to > >>produce feasible output. Machine translation can be resource-intensive, > >>but need not be prohibitively so. Disk and memory usage are mostly a > >>matter of model size, which for most language pairs is a few gigabytes at > >>most, at which size models can provide coverage on the order of tens or > >>even hundreds of thousands of words in the input and output languages. > The > >>computational complexity of the algorithms used to search for > translations > >>of new sentences are typically linear in the number of words in the input > >>sentence, making it possible to run a translation engine on a personal > >>computer. > >> > >>The research community has produced many different open source > translation > >>projects for a range of programming languages and under a variety of > >>licenses. These projects include the core “decoder”, which takes a model > >>and uses it to translate new sentences between the language pair the > model > >>was defined for. They also typically include a large set of tools that > >>enable new models to be built from large sets of example translations > >>(“parallel data”) and monolingual texts. These toolkits are usually built > >>to support the agendas of the (largely) academic researchers that build > >>them: the repeated cycle of building new models, tuning model parameters > >>against development data, and evaluating them against held-out test data, > >>using standard metrics for testing the quality of MT output. > >> > >>Together, these three factors—the quality of machine translation output, > >>the feasibility of translating on standard computers, and the > availability > >>of tools to build models—make it reasonable for the end users to use MT > as > >>a black-box service, and to run it on their personal machine. > >> > >>These factors make it a good time for an organization with the status of > >>the Apache Foundation to host a machine translation project. > >> > >>== Current Status == > >>Joshua was originally ported from David Chiang’s Python implementation of > >>Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins > >>University. The current version is maintained by Matt Post at Johns > >>Hopkins’ Human Language Technology Center of Excellence. Joshua has made > >>many releases with a list of over 20 source code tags. The last release > of > >>Joshua was 6.0.5 on November 5th, 2015. > >> > >>== Meritocracy == > >>The current developers are familiar with meritocratic open source > >>development at Apache. Apache was chosen specifically because we want to > >>encourage this style of development for the project. > >> > >>== Community == > >>Joshua is used widely across the world. Perhaps its biggest (known) > >>research / industrial user is the Amazon research group in Berlin. > Another > >>user is the US Army Research Lab. No formal census has been undertaken, > >>but posts to the Joshua technical support mailing list, along with the > >>occasional contributions, suggest small research and academic communities > >>spread across the world, many of them in India. > >> > >>During incubation, we will explicitly seek to increase our usage across > >>the board, including academic research, industry, and other end users > >>interested in statistical machine translation. > >> > >>== Core Developers == > >>The current set of core developers is fairly small, having fallen with > the > >>graduation from Johns Hopkins of some core student participants. However, > >>Joshua is used fairly widely, as mentioned above, and there remains a > >>commitment from the principal researcher at Johns Hopkins to continue to > >>use and develop it. Joshua has seen a number of new community members > >>become interested recently due to a potential for its projected use in a > >>number of ongoing DARPA projects such as XDATA and Memex. > >> > >>== Alignment == > >>Joshua is currently Copyright (c) 2015, Johns Hopkins University All > >>rights reserved and licensed under BSD 2-clause license. It would of > >>course be the intention to relicense this code under AL2.0 which would > >>permit expanded and increased use of the software within Apache projects. > >>There is currently an ongoing effort within the Apache Tika community to > >>utilize Joshua within Tika’s Translate API, see > >>[[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. > >> > >>== Known Risks == > >> > >>=== Orphaned products === > >>At the moment, regular contributions are made by a single contributor, > the > >>lead maintainer. He (Matt Post) plans to continue development for the > next > >>few years, but it is still a single point of failure, since the graduate > >>students who worked on the project have moved on to jobs, mostly in > >>industry. However, our goal is to help that process by growing the > >>community in Apache, and at least in growing the community with users and > >>participants from NASA JPL. > >> > >>=== Inexperience with Open Source === > >>The team both at Johns Hopkins and NASA JPL have experience with many OSS > >>software projects at Apache and elsewhere. We understand "how it works" > >>here at the foundation. > >> > >> > >>== Relationships with Other Apache Products == > >>Joshua includes dependences on Hadoop, and also is included as a plugin > in > >>Apache Tika. We are also interested in coordinating with other projects > >>including Spark, and other projects needing MT services for language > >>translation. > >> > >>== Developers == > >>Joshua only has one regular developer who is employed by Johns Hopkins > >>University. NASA JPL (Mattmann and McGibbney) have been contributing > >>lately including a Brew formula and other contributions to the project > >>through the DARPA XDATA and Memex programs. > >> > >>== Documentation == > >>Documentation and publications related to Joshua can be found at > >>joshua-decoder.org. The source for the Joshua documentation is currently > >>hosted on Github at > >>https://github.com/joshua-decoder/joshua-decoder.github.com > >> > >>== Initial Source == > >>Current source resides at Github: github.com/joshua-decoder/joshua (the > >>main decoder and toolkit) and github.com/joshua-decoder/thrax (the > grammar > >>extraction tool). > >> > >>== External Dependencies == > >>Joshua has a number of external dependencies. Only BerkeleyLM (Apache > 2.0) > >>and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is > >>needed for translating sentences with pre-built models). The rest are > >>dependencies for the build system and pipeline, used for constructing and > >>training new models from parallel text. > >> > >>Apache projects: > >> * Ant > >> * Hadoop > >> * Commons > >> * Maven > >> * Ivy > >> > >>There are also a number of other open-source projects with various > >>licenses that the project depends on both dynamically (runtime), and > >>statically. > >> > >>=== GNU GPL 2 === > >> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ > >> > >>=== LGPG 2.1 === > >> * KenLM: github.com/kpu/kenlm > >> > >>=== Apache 2.0 === > >> * BerkeleyLM: https://code.google.com/p/berkeleylm/ > >> > >>=== GNU GPL === > >> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html > >> > >>== Required Resources == > >> * Mailing Lists > >> * priv...@joshua.incubator.apache.org > >> * d...@joshua.incubator.apache.org > >> * comm...@joshua.incubator.apache.org > >> > >> * Git Repos > >> * https://git-wip-us.apache.org/repos/asf/joshua.git > >> > >> * Issue Tracking > >> * JIRA Joshua (JOSHUA) > >> > >> * Continuous Integration > >> * Jenkins builds on https://builds.apache.org/ > >> > >> * Web > >> * http://joshua.incubator.apache.org/ > >> * wiki at http://cwiki.apache.org > >> > >>== Initial Committers == > >>The following is a list of the planned initial Apache committers (the > >>active subset of the committers for the current repository on Github). > >> > >> * Matt Post (p...@cs.jhu.edu) > >> * Lewis John McGibbney (lewi...@apache.org) > >> * Chris Mattmann (mattm...@apache.org) > >> > >>== Affiliations == > >> > >> * Johns Hopkins University > >> * Matt Post > >> > >> * NASA JPL > >> * Chris Mattmann > >> * Lewis John McGibbney > >> > >> > >>== Sponsors == > >>=== Champion === > >> * Chris Mattmann (NASA/JPL) > >> > >>=== Nominated Mentors === > >> * Paul Ramirez > >> * Lewis John McGibbney > >> * Chris Mattmann > >> > >>== Sponsoring Entity == > >>The Apache Incubator > >> > >> > >> > >> > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Chris Mattmann, Ph.D. > >>Chief Architect > >>Instrument Software and Science Data Systems Section (398) > >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>Office: 168-519, Mailstop: 168-527 > >>Email: chris.a.mattm...@nasa.gov > >>WWW: http://sunset.usc.edu/~mattmann/ > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Adjunct Associate Professor, Computer Science Department > >>University of Southern California, Los Angeles, CA 90089 USA > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> > > > > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >