Re: [Apertium-stuff] GSOC 2020 idea

Rajarshi Roychoudhury Fri, 06 Mar 2020 00:41:37 -0800

Hi,
I have written my idea in the file attached . It is just the idea , not the
project proposal . Kindly read the idea and give feedback on whether this
can be a feasible GSoC project.
Best,
Rajarshi


On Fri, 28 Feb 2020 at 06:31, Rajarshi Roychoudhury <
rroychoudhu...@gmail.com> wrote:

> Here are some published papers on how character embeddings are used for
> classification.
>
> https://www.google.com/url?sa=t&source=web&rct=j&url=https://arxiv.org/abs/1810.03595&ved=2ahUKEwiu-ajdgvPnAhXXxzgGHQAWA3cQFjAVegQIDBAB&usg=AOvVaw0LQ60M-KXtk-NGyAoVqmeU
>
> https://lsm.media.mit.edu/papers/tweet2vec_vvr.pdf
>
> https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf
>
>
> We have just finished writting a paper on this and have got better results
> than the one in the papers mentioned above.The dataset is collected from
> sentiwordnet as i mentioned earlier.
> I am not on the IRC ,i will join it then.
>
>
> Best,
> Rajarshi
>
>
> On Fri, Feb 28, 2020, 01:01 Tanmai Khanna <khanna.tan...@gmail.com> wrote:
>
>> How exactly can characters predict sentiment? Don’t you still need some
>> training data for pairs? English, Hindi, Bangla aren’t really low resource
>> languages.
>>
>> Anyway, we can continue this discussion on the IRC so that it’ll be
>> easier and more people can contribute to the discussion.
>>
>> Tanmai
>>
>> Sent from my iPhone
>>
>> On 28-Feb-2020, at 00:52, Rajarshi Roychoudhury <rroychoudhu...@gmail.com>
>> wrote:
>>
>> 
>> To answer the question on how to analyse sentiment on low resource
>> language , I think character embedding would be the best option. The words
>> in the corpus is not exhaustive but the number of unique characters is
>> certainly well deterministic. We can figure out the embedding weight for
>> each character, and can apply it for a number of NLP techniques, not just
>> sentiment analysis.The downside of low resource language can be slightly
>> minimised using that.
>>
>> On Fri, Feb 28, 2020, 00:46 Rajarshi Roychoudhury <
>> rroychoudhu...@gmail.com> wrote:
>>
>>> As I mentioned earlier, I would like to work on English-Hindi or
>>> English-Bengali translation, the dataset can be obtained from sentiwordnet
>>> for Indian languages,
>>> https://amitavadas.com/sentiwordnet.php
>>> which is by far the most resourceful dataset available for sentiment
>>> analysis.It contains data for both Hindi and Bengali.
>>>
>>> I cannot give any example specific to apertium because whenever I try to
>>> translate a word from English in the interface, the available languages for
>>> translation are beyond my knowledge. I am not sure if I am right, but
>>> Hindi/Bengali is probably not one of the languages to which an English word
>>> can be translated to. Correct me if I am wrong
>>>
>>>
>>>
>>> On Fri, Feb 28, 2020, 00:31 Tanmai Khanna <khanna.tan...@gmail.com>
>>> wrote:
>>>
>>>> Hi, I have a few questions about this:
>>>> 1. How would you analyse the sentiment of the source text? Considering
>>>> the language pairs that Apertium deals with are low resource languages.
>>>> 2. As Tino mentions, is there a problem of sentiment loss in Apertium?
>>>> Any examples of this?
>>>> 3. Doesn't the sentiment analysis of a language require a decent amount
>>>> of training data? Where would this data be found for low resource 
>>>> languages?
>>>>
>>>> Tanmai
>>>>
>>>> On Fri, Feb 28, 2020 at 12:02 AM Rajarshi Roychoudhury <
>>>> rroychoudhu...@gmail.com> wrote:
>>>>
>>>>> The effect won't be very evident on simple sentences, I think it would
>>>>> be more effective on sentences where choice of words can decide the
>>>>> efficiency of translation. It's not about if "Watch out" could be " be
>>>>> careful" , it's about choosing words that can  retain the urgency in 
>>>>> "watch
>>>>> out". Sentiment information on original sentence can help in that.
>>>>>
>>>>> On Thu, Feb 27, 2020, 23:47 Scoop Gracie <scoopgra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> So, "Watch out!" Could become "Be careful"?
>>>>>>
>>>>>> On Thu, Feb 27, 2020, 10:13 Rajarshi Roychoudhury <
>>>>>> rroychoudhu...@gmail.com> wrote:
>>>>>>
>>>>>>> It is not just about  minimizing loss of sentiment , it is about
>>>>>>> using that information for better translation. A very trivial example 
>>>>>>> would
>>>>>>> be that for some situations , sentences can project a strong sentiment 
>>>>>>> and
>>>>>>> simple translation may not always yield the best result. However if we 
>>>>>>> can
>>>>>>> use the knowledge of the sentiment to choose the words , it might give
>>>>>>> better result.
>>>>>>>
>>>>>>> As far as the codes are concerned, I need to study the source code ,
>>>>>>> or a detailed documentation for proposing a feasible solution.
>>>>>>>
>>>>>>> Best,
>>>>>>> Rajarshi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 27, 2020, 23:21 Tino Didriksen <m...@tinodidriksen.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> My first question would be, is this actually a problem for
>>>>>>>> rule-based machine translation? I am not a linguist, but given how RBMT
>>>>>>>> works I can't really see where sentiment would be lost in the process,
>>>>>>>> especially because Apertium is designed for related languages where
>>>>>>>> sentiment is mostly the same. But even for less related languages, it 
>>>>>>>> would
>>>>>>>> be down to the quality of the source language analysis.
>>>>>>>>
>>>>>>>> Beyond that, please learn how Apertium specifically works, not just
>>>>>>>> RBMT in general. http://wiki.apertium.org/wiki/Documentation is a
>>>>>>>> good start, but our IRC channel is the best place to ask technical
>>>>>>>> questions.
>>>>>>>>
>>>>>>>> One major issue specific to Apertium is that the source information
>>>>>>>> is no longer available in the target generation step.
>>>>>>>>
>>>>>>>> E.g., since you mention English-Hindi, you could install
>>>>>>>> apertium-eng-hin and see how each part of the pipe works. We have
>>>>>>>> precompiled binaries common platforms. Again, see wiki and IRC.
>>>>>>>>
>>>>>>>> -- Tino Didriksen
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 27 Feb 2020 at 08:16, Rajarshi Roychoudhury <
>>>>>>>> rroychoudhu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Formally i present my idea in this form:
>>>>>>>>> From my understanding of RBMT ,
>>>>>>>>>
>>>>>>>>> The RBMT system contains:
>>>>>>>>>
>>>>>>>>>    - a *SL morphological analyser* - analyses a source language
>>>>>>>>>    word and provides the morphological information;
>>>>>>>>>    - a *SL parser* - is a syntax analyser which analyses source
>>>>>>>>>    language sentences;
>>>>>>>>>    - a *translator* - used to translate a source language word
>>>>>>>>>    into the target language;
>>>>>>>>>    - a *TL morphological generator* - works as a generator of
>>>>>>>>>    appropriate target language words for the given grammatica 
>>>>>>>>> information;
>>>>>>>>>    - a *TL parser* - works as a composer of suitable target
>>>>>>>>>    language sentences
>>>>>>>>>
>>>>>>>>> I propose a 6th component of the RBMT system: *sentiment based TL
>>>>>>>>> morphological generator*
>>>>>>>>>
>>>>>>>>> I propose that we do word level sentiment analysis of the source
>>>>>>>>> language and targeted language. For the time being i want to work on
>>>>>>>>> English-Hindi translation. We do not need a neural network based
>>>>>>>>> translation, however for getting the sentiment associated with each 
>>>>>>>>> word we
>>>>>>>>> might use nltk,or develop a character level embedding to just find 
>>>>>>>>> out the
>>>>>>>>> sentiment assosiated with each word,and form a dictionary out of it.I 
>>>>>>>>> have
>>>>>>>>> written a paper on it,and received good results.So basically,during 
>>>>>>>>> the
>>>>>>>>> final application development we will just have the dictionary,with no
>>>>>>>>> neural network dependencies. This  can easily be done with Python.I 
>>>>>>>>> just
>>>>>>>>> need a good corpus of English and Hindi words(the sentiment datasets 
>>>>>>>>> are
>>>>>>>>> available online).
>>>>>>>>>
>>>>>>>>> The *sentiment based TL morphological generator *will generate
>>>>>>>>> the list of possible words,and we will take that word whose sentiment 
>>>>>>>>> is
>>>>>>>>> closest to the source language word.
>>>>>>>>> This is a novel method that has probably not been applied before,
>>>>>>>>> and might generate better results.
>>>>>>>>>
>>>>>>>>> Please provide your valuable feedwork and suggest some necessary
>>>>>>>>> changes that needs to be made.
>>>>>>>>> Best,
>>>>>>>>> Rajarshi
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Apertium-stuff mailing list
>>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Apertium-stuff mailing list
>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>>
>>>>
>>>> --
>>>> *Khanna, Tanmai*
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>

Gsoc_Project_Idea.docx
Description: MS-Word 2007 document

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC 2020 idea

Reply via email to