On Mon, Apr 22, 2013 at 7:03 PM, piyush arora <[email protected]> wrote:
> I looked into two main aspects 1) Learning Different Models Using Moses and > 2) Data sources for English-Bengali language pair. > > Models of our interest- > 1) Phrased Based Approach- rely on parallel corpora and monolingual corpus > for forming language model and dictionaries can also be incorporated to get > enriched information and reduce missing or erroneous translation. > 2) Factored Approach http://www.statmt.org/moses/?n=Moses.FactoredModels- > rely on parallel data with more enhanced information such as part of speech, > morph and other information. > > I read some papers it seems factored approach perform better as compared to > phrased based approaches, it can be interpreted as a model over Phrase > incorporating linguistic information and cues. > > Data Sources- > The main 2-3 data sources that have been used are as follows- > 1) EMILLE corpus http://www.elda.org/catalogue/en/text/W0037.html it has > about 18k bengali-english sentences The EMILLE license is somewhat restrictive. I am not a lawyer, but using an EMILLE system to train may make the resulting application not useful. > 2) Joshua Corpus > > http://joshua-decoder.org Have you tried reaching out to ISI Kolkata as to whether they have data-sources? > 3) Learning information from wikipedia dumb which has about 25k articles > >>Golam did a stellar job with Anubadok. However, it does have a current >>limitation in being unable to be turned on to a data source of scale >>in order to have inorganic generation of content in Bengali > > Can you provide more information about same ? Is it more a rule based / > syntax model . Anubadok had an effort to turn it into an Apertium project (probably during an earlier iteration of GSoC). You should get in touch with Golam and get more detail. /sankarshan _______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
