On Thu, Apr 18, 2013 at 4:26 PM, piyush arora <[email protected]> wrote: > Hi Sanskarshan,
'Sankarshan' > Sure not a problem, we can discuss more during the weekend. Sampark is a > government funded project and the code for the implementation is not > available as per I now we can look into details for same. Alright. The immediate issue this brings forth is that you'd have to bring in a "clean room" implementation. Ideas, models shouldn't overlap with the Sampark system and nor should they be strikingly similar. > We can start by looking how Moses performs and do the error analysis and > make improvisation over same using the necessary methods . What data are we > using for learning can you provide more details about the corpus that we > have in terms of number of sentences. An aspect of the project idea is that the student would propose appropriate data sources. The corpus chosen need not be on a giant scale, but it should be promising enough to be expanded. I envisage the system to be deployed and actively learning (as opposed to a toy project in a show-case). > I was also thinking it would be a good idea to first do a ground research > about other English-Bengali systems and use the knowledge from same. > > Two important systems which I found are as follows- > 1) http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php this is a > government project and it's more on hybrid mechanism kind of a pipeline > architecture, we can discuss the details as per the need I know the > architecture and other detailed information about same. If the TDIL system is not freely licensed or, under an appropriate libre license, we may not want to spend time on it. > 2)Anubadok- (http://bengalinux.sourceforge.net/cgi-bin/anubadok/index.pl) it > seems this is an open source project and it's using some of the resources > been build by Ankur organization the English-Bengali dictionary > (http://www.bengalinux.org/cgi-bin/abhidhan/statistics.pl) so if you have > some more details about same then it will be great. I downloaded the > Anubadok system and is trying to have some hand-on experience on same and > look into the source code. Golam did a stellar job with Anubadok. However, it does have a current limitation in being unable to be turned on to a data source of scale in order to have inorganic generation of content in Bengali > Apart from this there is also an apertium project > (http://wiki.apertium.org/wiki/Apertium-bn-en) for English-Bengali language > pair which has some of the tools and resources available. Apertium has promise. During the previous year of GSoC we had a proposal around Apertium and extending/enhancing it. > I have few queries- > What are we aiming by this project as far as I see there can be 3 different > aspects- We are aiming to create a reasonably robust MT system that we can deploy and point to a content source of significant volume and obtain translated content (in Bengali, primarily) which can be curated and the MT system can continue to learn from the curation/editing. In short, a sentient continuous MT system. > 1) We want to begin from scratch and use statistical mt and see how it works > for English-Bengali language pair and over this statistical approach use > other knowledge to learn rules and make a translation model / prototype. Works good for a long haul project, but not for the duration of the GSoC > 2) Search and based on the available other models and resources such as > chunker, pos tagger which are openly available make a model combining the > available resources and build a MT system. An option that can be investigated. > 3) Take some of the exiting system and improve over same using statistical > approaches. At this stage, probably the option we need to assess quickly and first. -- sankarshan mukhopadhyay <https://twitter.com/#!/sankarshan> _______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
