Re: [Apertium-stuff] GSOC 2020 idea

Rajarshi Roychoudhury Thu, 27 Feb 2020 11:23:24 -0800

To answer the question on how to analyse sentiment on low resource language
, I think character embedding would be the best option. The words in the
corpus is not exhaustive but the number of unique characters is certainly
well deterministic. We can figure out the embedding weight for each
character, and can apply it for a number of NLP techniques, not just
sentiment analysis.The downside of low resource language can be slightly
minimised using that.


On Fri, Feb 28, 2020, 00:46 Rajarshi Roychoudhury <rroychoudhu...@gmail.com>
wrote:

> As I mentioned earlier, I would like to work on English-Hindi or
> English-Bengali translation, the dataset can be obtained from sentiwordnet
> for Indian languages,
> https://amitavadas.com/sentiwordnet.php
> which is by far the most resourceful dataset available for sentiment
> analysis.It contains data for both Hindi and Bengali.
>
> I cannot give any example specific to apertium because whenever I try to
> translate a word from English in the interface, the available languages for
> translation are beyond my knowledge. I am not sure if I am right, but
> Hindi/Bengali is probably not one of the languages to which an English word
> can be translated to. Correct me if I am wrong
>
>
>
> On Fri, Feb 28, 2020, 00:31 Tanmai Khanna <khanna.tan...@gmail.com> wrote:
>
>> Hi, I have a few questions about this:
>> 1. How would you analyse the sentiment of the source text? Considering
>> the language pairs that Apertium deals with are low resource languages.
>> 2. As Tino mentions, is there a problem of sentiment loss in Apertium?
>> Any examples of this?
>> 3. Doesn't the sentiment analysis of a language require a decent amount
>> of training data? Where would this data be found for low resource languages?
>>
>> Tanmai
>>
>> On Fri, Feb 28, 2020 at 12:02 AM Rajarshi Roychoudhury <
>> rroychoudhu...@gmail.com> wrote:
>>
>>> The effect won't be very evident on simple sentences, I think it would
>>> be more effective on sentences where choice of words can decide the
>>> efficiency of translation. It's not about if "Watch out" could be " be
>>> careful" , it's about choosing words that can  retain the urgency in "watch
>>> out". Sentiment information on original sentence can help in that.
>>>
>>> On Thu, Feb 27, 2020, 23:47 Scoop Gracie <scoopgra...@gmail.com> wrote:
>>>
>>>> So, "Watch out!" Could become "Be careful"?
>>>>
>>>> On Thu, Feb 27, 2020, 10:13 Rajarshi Roychoudhury <
>>>> rroychoudhu...@gmail.com> wrote:
>>>>
>>>>> It is not just about  minimizing loss of sentiment , it is about using
>>>>> that information for better translation. A very trivial example would be
>>>>> that for some situations , sentences can project a strong sentiment and
>>>>> simple translation may not always yield the best result. However if we can
>>>>> use the knowledge of the sentiment to choose the words , it might give
>>>>> better result.
>>>>>
>>>>> As far as the codes are concerned, I need to study the source code ,
>>>>> or a detailed documentation for proposing a feasible solution.
>>>>>
>>>>> Best,
>>>>> Rajarshi
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 27, 2020, 23:21 Tino Didriksen <m...@tinodidriksen.com>
>>>>> wrote:
>>>>>
>>>>>> My first question would be, is this actually a problem for rule-based
>>>>>> machine translation? I am not a linguist, but given how RBMT works I 
>>>>>> can't
>>>>>> really see where sentiment would be lost in the process, especially
>>>>>> because Apertium is designed for related languages where sentiment is
>>>>>> mostly the same. But even for less related languages, it would be down to
>>>>>> the quality of the source language analysis.
>>>>>>
>>>>>> Beyond that, please learn how Apertium specifically works, not just
>>>>>> RBMT in general. http://wiki.apertium.org/wiki/Documentation is a
>>>>>> good start, but our IRC channel is the best place to ask technical
>>>>>> questions.
>>>>>>
>>>>>> One major issue specific to Apertium is that the source information
>>>>>> is no longer available in the target generation step.
>>>>>>
>>>>>> E.g., since you mention English-Hindi, you could install
>>>>>> apertium-eng-hin and see how each part of the pipe works. We have
>>>>>> precompiled binaries common platforms. Again, see wiki and IRC.
>>>>>>
>>>>>> -- Tino Didriksen
>>>>>>
>>>>>>
>>>>>> On Thu, 27 Feb 2020 at 08:16, Rajarshi Roychoudhury <
>>>>>> rroychoudhu...@gmail.com> wrote:
>>>>>>
>>>>>>> Formally i present my idea in this form:
>>>>>>> From my understanding of RBMT ,
>>>>>>>
>>>>>>> The RBMT system contains:
>>>>>>>
>>>>>>>    - a *SL morphological analyser* - analyses a source language
>>>>>>>    word and provides the morphological information;
>>>>>>>    - a *SL parser* - is a syntax analyser which analyses source
>>>>>>>    language sentences;
>>>>>>>    - a *translator* - used to translate a source language word into
>>>>>>>    the target language;
>>>>>>>    - a *TL morphological generator* - works as a generator of
>>>>>>>    appropriate target language words for the given grammatica 
>>>>>>> information;
>>>>>>>    - a *TL parser* - works as a composer of suitable target
>>>>>>>    language sentences
>>>>>>>
>>>>>>> I propose a 6th component of the RBMT system: *sentiment based TL
>>>>>>> morphological generator*
>>>>>>>
>>>>>>> I propose that we do word level sentiment analysis of the source
>>>>>>> language and targeted language. For the time being i want to work on
>>>>>>> English-Hindi translation. We do not need a neural network based
>>>>>>> translation, however for getting the sentiment associated with each 
>>>>>>> word we
>>>>>>> might use nltk,or develop a character level embedding to just find out 
>>>>>>> the
>>>>>>> sentiment assosiated with each word,and form a dictionary out of it.I 
>>>>>>> have
>>>>>>> written a paper on it,and received good results.So basically,during the
>>>>>>> final application development we will just have the dictionary,with no
>>>>>>> neural network dependencies. This  can easily be done with Python.I just
>>>>>>> need a good corpus of English and Hindi words(the sentiment datasets are
>>>>>>> available online).
>>>>>>>
>>>>>>> The *sentiment based TL morphological generator *will generate the
>>>>>>> list of possible words,and we will take that word whose sentiment is
>>>>>>> closest to the source language word.
>>>>>>> This is a novel method that has probably not been applied before,
>>>>>>> and might generate better results.
>>>>>>>
>>>>>>> Please provide your valuable feedwork and suggest some necessary
>>>>>>> changes that needs to be made.
>>>>>>> Best,
>>>>>>> Rajarshi
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> --
>> *Khanna, Tanmai*
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC 2020 idea

Reply via email to