First Call For Participation

Determining Sentiment Intensity of English and Arabic Phrases
SemEval 2016 - Task 7

http://alt.qcri.org/semeval2016/task7/

The objective of the task is to test an automatic system’s ability to
predict a sentiment intensity (aka evaluativeness and sentiment
association) score for a word or a phrase. Phrases include negators,
modals, intensifiers, and diminishers ­-- categories known to be
challenging for sentiment analysis. Specifically, the participants will be
given a list of terms (single words and multi­word phrases) and be asked to
provide a score between 0 and 1 that is indicative of the term’s strength
of association with positive sentiment. A score of 1 indicates maximum
association with positive sentiment (or least association with negative
sentiment) and a score of 0 indicates least association with positive
sentiment (or maximum association with negative sentiment). If a term is
more positive than another, then it should have a higher score than the
other.

We introduced this task as part of the SemEval­-2015 Task 10 Sentiment
Analysis in Twitter, Subtask E (Rosenthal et al., 2015), where the target
terms were taken from Twitter. In SemEval­-2016, we broaden the scope of
the task and include three different domains: general English, English
Twitter, and Arabic Twitter. The Twitter domain differs significantly from
the general English domain; it includes hashtags, that are often a
composition of several words (e.g., #f​eelingood)​, misspellings,
shortings, slang, etc.


SUBTASKS

We will have three subtasks, one for each of the three domains:

-- General English Sentiment Modifiers Set: This test set has phrases
formed by combining a word and a modifier, where a modifier is a negator,
an auxilary verb, a degree adverb, or a combination of those.  For example,
'would be very easy', 'did not harm', and 'would have been nice'. (See
development data for more examples.) The test set also includes single word
terms (as separate entries). These terms are chosen from the set of words
that are part of the multi-word phrases.  For example, 'easy', 'harm', and
'nice'. The terms in the test set will have the same form as the terms in
the development set, but can involve different words and modifiers.

-- English Twitter Mixed Polarity Set: This test set focuses on phrases
made up of opposite polarity terms. For example, phrases such as 'lazy
sundays', 'best winter break', 'happy accident', and 'couldn't stop
smiling'. Observe that 'lazy' is associated with negative sentiment whereas
'sundays' is associated with positive sentiment. Automatic systems have to
determine the degree of association of the whole phrase with positive
sentiment. The test set also includes single word terms (as separate
entries). These terms are chosen from the set of words that are part of the
multi-word phrases. For example, terms such as 'lazy', 'sundays', 'best',
'winter', and so on. This allows the evaluation to determine how good the
automatic systems are at determining sentiment association of individual
words as well as how good they are at determining sentiment of phrases
formed by their combinations. The multi-word phrases and single-word terms
are drawn from a corpus of tweets, and may include a small number of
hashtag words and creatively spelled words. However, a majority of the
terms are those that one would use in everyday English.

-- Arabic Twitter Set: This test set includes single words and phrases
commonly found in Arabic tweets. The phrases in this set are formed only by
combining a negator and a word. See development data for examples.
In each subtask the target terms are chosen from the corresponding domain.
We will provide a development set  and a test set for each domain. No
separate training data will be provided. The development sets will be large
enough to be used for tuning or even for training. The test sets and the
development sets will have no terms in common. The participants are free to
use any additional manually or automatically generated resources; however,
we will require that all resources be clearly identified in the submission
files and in the system description paper.

All of these terms are manually annotated to obtain their strength of
association scores. We use CrowdFlower to crowdsource the annotations. We
use the MaxDiff method of annotation. Kiritchenko et al. (2014) showed that
even though annotators might disagree about answers to individual
questions, the aggregated scores produced with MaxDiff and the
corresponding term ranking are consistent. We verified this by randomly
selecting ten groups of five answers to each question and comparing the
scores and rankings obtained from these groups of annotations. On average,
the scores of the terms from the data we have previously annotated
(SemEval­-2015 Subtask E Twitter data and SemEval­-2016 general English
terms) differed only by 0.02-­0.04 per term, and the Spearman rank
correlation coefficient between two sets of rankings was 0.97-0.98.


EVALUATION

The participants can submit results for any one, two, or all three
subtasks. We will provide separate test files for each subtask. The test
file will contain a list of terms from the corresponding domain. The
participating systems are expected to assign a sentiment intensity score to
each term. The order of the terms in the submissions can be arbitrary.

System ratings for terms in each subtask will be evaluated by first ranking
the terms according to sentiment score and then comparing this ranked list
to a ranked list obtained from human annotations. Kendall's Tau (Kendall,
1938) will be used as the metric to compare the ranked lists. We will
provide scores for Spearman's Rank Correlation as well, but participating
teams will be ranked by Kendall's Tau.
We have released an evaluation script so that participants can:
-- make sure their output is in the right format;
-- track the progress of their system's performance on the development data.


IMPORTANT DATES

-- Training data ready: September 4, 2015
-- Test data ready: Dec 15, 2015
-- Evaluation start: January 10, 2016
-- Evaluation end: January 31, 2016
-- Paper submission due: February 28, 2016
-- Paper reviews due: March 31, 2016
-- Camera ready due: April 30, 2016
-- SemEval workshop: Summer 2016


BACKGROUND AND MOTIVATION

Many of the top performing sentiment analysis systems in recent SemEval
competitions (2013 Task 2, 2014 Task 4, and 2014 Task 9) rely on
automatically generated sentiment lexicons. Sentiment lexicons are lists of
words (and phrases) with prior associations to positive and negative
sentiments. Some lexicons can additionally provide a sentiment score for a
term to indicate its strength of evaluative intensity. Higher scores
indicate greater intensity. Existing manually created sentiment lexicons
tend to only have discrete labels for terms (positive, negative, neutral)
but no real-­valued scores indicating the intensity of sentiment. Here for
the first time we manually create a dataset of words and phrases with
real-­valued scores of intensity. The goal of this task is to evaluate
automatic methods for determining sentiment scores of words and phrases.
Many of the phrases in the test set will include negators (such as 'no' and
'doesn’t'), modals (such as 'could' and 'may be'), and intensifiers and
diminishers (such as 'very' and 'slightly'). This task will enable
researchers to examine methods for estimating how each of these word
categories impact intensity of sentiment.


ORGANIZERS

-- Svetlana Kiritchenko, National Research Council Canada
-- Saif M. Mohammad, National Research Council Canada
-- Mohammad Salameh, University of Alberta

-- 
Saif Mohammad
Research Officer
Information and Communications Technologies Portfolio
National Research Council Canada
http://www.saifmohammad.com
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to