Hi, I have written my idea in the file attached . It is just the idea , not the project proposal . Kindly read the idea and give feedback on whether this can be a feasible GSoC project. Best, Rajarshi
On Fri, 28 Feb 2020 at 06:31, Rajarshi Roychoudhury < rroychoudhu...@gmail.com> wrote: > Here are some published papers on how character embeddings are used for > classification. > > https://www.google.com/url?sa=t&source=web&rct=j&url=https://arxiv.org/abs/1810.03595&ved=2ahUKEwiu-ajdgvPnAhXXxzgGHQAWA3cQFjAVegQIDBAB&usg=AOvVaw0LQ60M-KXtk-NGyAoVqmeU > > https://lsm.media.mit.edu/papers/tweet2vec_vvr.pdf > > https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf > > > We have just finished writting a paper on this and have got better results > than the one in the papers mentioned above.The dataset is collected from > sentiwordnet as i mentioned earlier. > I am not on the IRC ,i will join it then. > > > Best, > Rajarshi > > > On Fri, Feb 28, 2020, 01:01 Tanmai Khanna <khanna.tan...@gmail.com> wrote: > >> How exactly can characters predict sentiment? Don’t you still need some >> training data for pairs? English, Hindi, Bangla aren’t really low resource >> languages. >> >> Anyway, we can continue this discussion on the IRC so that it’ll be >> easier and more people can contribute to the discussion. >> >> Tanmai >> >> Sent from my iPhone >> >> On 28-Feb-2020, at 00:52, Rajarshi Roychoudhury <rroychoudhu...@gmail.com> >> wrote: >> >> >> To answer the question on how to analyse sentiment on low resource >> language , I think character embedding would be the best option. The words >> in the corpus is not exhaustive but the number of unique characters is >> certainly well deterministic. We can figure out the embedding weight for >> each character, and can apply it for a number of NLP techniques, not just >> sentiment analysis.The downside of low resource language can be slightly >> minimised using that. >> >> On Fri, Feb 28, 2020, 00:46 Rajarshi Roychoudhury < >> rroychoudhu...@gmail.com> wrote: >> >>> As I mentioned earlier, I would like to work on English-Hindi or >>> English-Bengali translation, the dataset can be obtained from sentiwordnet >>> for Indian languages, >>> https://amitavadas.com/sentiwordnet.php >>> which is by far the most resourceful dataset available for sentiment >>> analysis.It contains data for both Hindi and Bengali. >>> >>> I cannot give any example specific to apertium because whenever I try to >>> translate a word from English in the interface, the available languages for >>> translation are beyond my knowledge. I am not sure if I am right, but >>> Hindi/Bengali is probably not one of the languages to which an English word >>> can be translated to. Correct me if I am wrong >>> >>> >>> >>> On Fri, Feb 28, 2020, 00:31 Tanmai Khanna <khanna.tan...@gmail.com> >>> wrote: >>> >>>> Hi, I have a few questions about this: >>>> 1. How would you analyse the sentiment of the source text? Considering >>>> the language pairs that Apertium deals with are low resource languages. >>>> 2. As Tino mentions, is there a problem of sentiment loss in Apertium? >>>> Any examples of this? >>>> 3. Doesn't the sentiment analysis of a language require a decent amount >>>> of training data? Where would this data be found for low resource >>>> languages? >>>> >>>> Tanmai >>>> >>>> On Fri, Feb 28, 2020 at 12:02 AM Rajarshi Roychoudhury < >>>> rroychoudhu...@gmail.com> wrote: >>>> >>>>> The effect won't be very evident on simple sentences, I think it would >>>>> be more effective on sentences where choice of words can decide the >>>>> efficiency of translation. It's not about if "Watch out" could be " be >>>>> careful" , it's about choosing words that can retain the urgency in >>>>> "watch >>>>> out". Sentiment information on original sentence can help in that. >>>>> >>>>> On Thu, Feb 27, 2020, 23:47 Scoop Gracie <scoopgra...@gmail.com> >>>>> wrote: >>>>> >>>>>> So, "Watch out!" Could become "Be careful"? >>>>>> >>>>>> On Thu, Feb 27, 2020, 10:13 Rajarshi Roychoudhury < >>>>>> rroychoudhu...@gmail.com> wrote: >>>>>> >>>>>>> It is not just about minimizing loss of sentiment , it is about >>>>>>> using that information for better translation. A very trivial example >>>>>>> would >>>>>>> be that for some situations , sentences can project a strong sentiment >>>>>>> and >>>>>>> simple translation may not always yield the best result. However if we >>>>>>> can >>>>>>> use the knowledge of the sentiment to choose the words , it might give >>>>>>> better result. >>>>>>> >>>>>>> As far as the codes are concerned, I need to study the source code , >>>>>>> or a detailed documentation for proposing a feasible solution. >>>>>>> >>>>>>> Best, >>>>>>> Rajarshi >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 27, 2020, 23:21 Tino Didriksen <m...@tinodidriksen.com> >>>>>>> wrote: >>>>>>> >>>>>>>> My first question would be, is this actually a problem for >>>>>>>> rule-based machine translation? I am not a linguist, but given how RBMT >>>>>>>> works I can't really see where sentiment would be lost in the process, >>>>>>>> especially because Apertium is designed for related languages where >>>>>>>> sentiment is mostly the same. But even for less related languages, it >>>>>>>> would >>>>>>>> be down to the quality of the source language analysis. >>>>>>>> >>>>>>>> Beyond that, please learn how Apertium specifically works, not just >>>>>>>> RBMT in general. http://wiki.apertium.org/wiki/Documentation is a >>>>>>>> good start, but our IRC channel is the best place to ask technical >>>>>>>> questions. >>>>>>>> >>>>>>>> One major issue specific to Apertium is that the source information >>>>>>>> is no longer available in the target generation step. >>>>>>>> >>>>>>>> E.g., since you mention English-Hindi, you could install >>>>>>>> apertium-eng-hin and see how each part of the pipe works. We have >>>>>>>> precompiled binaries common platforms. Again, see wiki and IRC. >>>>>>>> >>>>>>>> -- Tino Didriksen >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 27 Feb 2020 at 08:16, Rajarshi Roychoudhury < >>>>>>>> rroychoudhu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Formally i present my idea in this form: >>>>>>>>> From my understanding of RBMT , >>>>>>>>> >>>>>>>>> The RBMT system contains: >>>>>>>>> >>>>>>>>> - a *SL morphological analyser* - analyses a source language >>>>>>>>> word and provides the morphological information; >>>>>>>>> - a *SL parser* - is a syntax analyser which analyses source >>>>>>>>> language sentences; >>>>>>>>> - a *translator* - used to translate a source language word >>>>>>>>> into the target language; >>>>>>>>> - a *TL morphological generator* - works as a generator of >>>>>>>>> appropriate target language words for the given grammatica >>>>>>>>> information; >>>>>>>>> - a *TL parser* - works as a composer of suitable target >>>>>>>>> language sentences >>>>>>>>> >>>>>>>>> I propose a 6th component of the RBMT system: *sentiment based TL >>>>>>>>> morphological generator* >>>>>>>>> >>>>>>>>> I propose that we do word level sentiment analysis of the source >>>>>>>>> language and targeted language. For the time being i want to work on >>>>>>>>> English-Hindi translation. We do not need a neural network based >>>>>>>>> translation, however for getting the sentiment associated with each >>>>>>>>> word we >>>>>>>>> might use nltk,or develop a character level embedding to just find >>>>>>>>> out the >>>>>>>>> sentiment assosiated with each word,and form a dictionary out of it.I >>>>>>>>> have >>>>>>>>> written a paper on it,and received good results.So basically,during >>>>>>>>> the >>>>>>>>> final application development we will just have the dictionary,with no >>>>>>>>> neural network dependencies. This can easily be done with Python.I >>>>>>>>> just >>>>>>>>> need a good corpus of English and Hindi words(the sentiment datasets >>>>>>>>> are >>>>>>>>> available online). >>>>>>>>> >>>>>>>>> The *sentiment based TL morphological generator *will generate >>>>>>>>> the list of possible words,and we will take that word whose sentiment >>>>>>>>> is >>>>>>>>> closest to the source language word. >>>>>>>>> This is a novel method that has probably not been applied before, >>>>>>>>> and might generate better results. >>>>>>>>> >>>>>>>>> Please provide your valuable feedwork and suggest some necessary >>>>>>>>> changes that needs to be made. >>>>>>>>> Best, >>>>>>>>> Rajarshi >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Apertium-stuff mailing list >>>>>>>> Apertium-stuff@lists.sourceforge.net >>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Apertium-stuff mailing list >>>>>>> Apertium-stuff@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>>> >>>>>> _______________________________________________ >>>>>> Apertium-stuff mailing list >>>>>> Apertium-stuff@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>> >>>>> _______________________________________________ >>>>> Apertium-stuff mailing list >>>>> Apertium-stuff@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>> >>>> >>>> >>>> -- >>>> *Khanna, Tanmai* >>>> _______________________________________________ >>>> Apertium-stuff mailing list >>>> Apertium-stuff@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>> >>> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> >
Gsoc_Project_Idea.docx
Description: MS-Word 2007 document
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff