[Mt-list] MLP 2017 Call for Participation in Shared Tasks on Cross-lingual Word Segmentation and Morpheme Segmentation

Mikel L. Forcada Thu, 18 May 2017 10:24:07 -0700

MLP 2017 Call for Participation in Shared Tasks on Cross-lingual WordSegmentation and Morpheme Segmentation

The analysis of word formation is among the most fundamental naturallanguage processing (NLP) technologies for extracting basic processingunits for further NLP tasks in many languages. There are broadly twogroups of segmentation tasks related to word formation, i.e. morphemesegmentation and word segmentation. Morpheme segmentation is requiredin languages such as Turkish, for example, where words are formed bystems, root words, prefixes, and/or suffixes. It is the foundation forfurther morphological analysis tasks. Word segmentation is necessary inlanguages such as Mandarin Chinese, where there are no word boundariesin the writing system.

Although there is clear similarity among different languages in terms ofeither morpheme segmentation or word segmentation, most of these toolsare designed specifically for one language. In this shared task, weencourage the participants to submit the results of one system/method asapplied to multiple languages for one of the two segmentation tasks.These systems are expected to demonstrate the ability of cross-lingualprocessing on the segmentation tasks, which would give insights to ourcommunity into the building of fundamental NLP tools for low resourcelanguages.

Popular languages such as Chinese and Japanese are also included in thetask for two reasons. Firstly, although morpheme segmentation and wordsegmentation tools for these languages have been developed for manyyears and are often regarded as mature technologies, human creativity,variability of textual genres and dialects as exhibited in languageevolution still make them challenging problems to these languages.Secondly, we would like to encourage participants of this shared taskto develop systems/methods that can be used across different languageswhere morpheme segmentation or word segmentation is required for naturallanguage processing.

A corpus of at least 2,000 sentences will be prepared as the trainingset in each language for either morpheme segmentation or wordsegmentation. Development and test sets will each include 1,000sentences for system development and evaluation purposes. The wholecorpus will comprise multiple genres s where plausible in both subtasks.Recommendations of additional language resources will also belisted/provided for some languages by the organizers. These resourcesmight include, but will not be limited to, dictionaries, articles,social media posts and bilingual (aligned) texts for the target languages.

The tasks will be organized into two subtasks - constrained andsemi-constrained, in the sense on the availability of annotated data inthe corpora. In the constrained subtasks, participants will use onlythe corpora provided by the shared task in the development of systems,where comparisons among different technologies exhibiting their pros andcons are easier to be made. In the semi-constrained subtasks,participants are encouraged to use additional publicly availableresources to further improve the performance of their systems. The foursubtasks are as follows; participants can take part in any (and all) ofthe subtasks. It should be noted that for the external data used insemi-constrained subtasks, only un-annotated (raw) data can be used,while annotated data with word or morpheme boundaries cannot.




 *

   Task: Word Segmentation (WS)

     o

       Subtask: Word Segmentation - Constrained (WSC)

     o

       Subtask: Word Segmentation - Semi-constrained (WSS)

 *

   Task: Morpheme Segmentation (MS)

     o

       Subtask: Morpheme Segmentation - Constrained (MSC)

     o

       Subtask: Morpheme Segmentation - Semi-constrained (MSS)

In the development, results of systems tuned only with the givendevelopment sets must be submitted. Participants may also submitadditional results tuned with different development sets, provided adescription on how these sets are produced is given, e.g. a subsetderived manually from the original given development set or by usingsome other method. The organizers will provide results of baselinesystems for constrained morpheme segmentation (MSC) and constrained wordsegmentation (WSC) tasks. The results of submitted systems will beevaluated against the prepared test set for each language. Precision,recall and F1 measure will be used as metrics for the evaluation.


TARGET LANGUAGES(listed in alphabetical order)

 *

   Word Segmentation: Mandarin Chinese, Thai, Vietnamese.

 *

   Morpheme Segmentation: Basque, Farsi, Japanese, Finnish, Kazakh,
   Marathi, Uyghur.

DATA SAMPLE


The format of the data is shown as below.

 *

   Uyghur; morpheme segmentation

ئسلاھ‪//‬ئات ئاچ‪//‬ئې‪//‬ۋەت‪//‬ىش‪//‬نى چوڭ‪//‬قۇرئىلگىرى سۈر//دۇق


 *

   Basque, i.e. Euskara; morpheme segmentation

      Paper\\a\\k mahai\\a\\ren gain\\ean daude

 *

   Mandarin Chinese; word segmentation

      美國 喬治亞 州 首府 亞特蘭大

SCHEDULE

   May 20, 2017              Shared Task Website Ready
   May 20, 2017              First Call for Participants Ready
   May 20, 2017              Registration Begins
   June 20, 2017             Release of Training Set
   July 5, 2017              Dryrun: Release of Development Set
   July 8, 2017              Dry run: Results Submission on Development Set
   July 10, 2017             Dryrun: Release of Scores
   July 12, 2017             Release of Surprise Languages (Training and
   Development Sets)
   July 20, 2017             Registration Ends
   July 24, 2017             Release of Test Set
   July 31, 2017             Submission of Systems
   August 4, 2017            System Results
   August 11, 2017           System Description Paper Due
   August 18, 2017           Notification of Acceptance
   August 25, 2017           Camera-Ready Deadline



Registration:

Please send a registration email [email protected]<mailto:[email protected]>with the following information:


 *

   Institution:

     o

       Name

     o

       Country

 *

   Contact person:

     o

       Title

     o

       Last Name

     o

       First Name

     o

       Email address

 *

   Tasks and Subtasks to participate in.

The title of a registration email should be:_Registration_.

ORGANIZERS:[listed in alphabetical order]

   Alberto Poncelas

        

   ADAPT Centre, Dublin City University

   Alex Huynh

        

   University of Science, Vietnam National University Ho Chi Minh City

   Chao-Hong Liu

        

   ADAPT Centre, Dublin City University

   Dinh Dien

        

   University of Science, Vietnam National University Ho Chi Minh City

   Francis Tyers UiT

        

   Norgga árktalaš universitehta

   Majid Latifi

        

   Universitat Politècnica de Catalunya

   Nasun-Urt

        

   Inner Mongolia University

   Prachya Boonkwan

        

   National Electronics and Computer Technology Center

   Teresa Lynn

        

   ADAPT Centre, Dublin City University

   Thepchai Supnithi

        

   National Electronics and Computer Technology Center

   Tommi A Pirinen

        

   Universität Hamburg

   Qun Liu

        

   ADAPT Centre, Dublin City University

   Vinit Ravishankar

        

   Maharashtra Institute of Technology

   Yating Yang

        

     University of Chinese Academy of Sciences

--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

[Mt-list] MLP 2017 Call for Participation in Shared Tasks on Cross-lingual Word Segmentation and Morpheme Segmentation

Reply via email to