[Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Sambhav Jain Mon, 26 Mar 2012 18:08:11 -0700

Hi,

I have shaped the idea for the "Corpus-based lexicalised feature
transfer". Feedback is welcome.


http://apertium.codepad.org/F9AlEyZS  (better formatted text)

 Apertium
 ========

  Corpus-based lexicalised feature transfer
  -----------------------------------------
  Problem Statement (as on wiki): Sometimes we get really inadequate
translations even though you'd never hear stuff like that. One of
those things is when we output something as definite when it is never
used as definite. One way of dealing with this is a lot of rules and
lists in transfer, but those are hard to do. So, how about looking at
a corpus for information about some features like definiteness,
aspect, evidentiality, impersonal/reflexive pronoun use in Romance
languages etc.

  DEFINITENESS
  ------------

  Proposing a module "Definiteness Adapter" which will lie just before
morphological generation in the apertium pipeline.


 1) What module does?
    =================
    -Remove explicit definiteness marker if not expected by the Language.
    -Introduce an explicit definiteness marker if it is expected by
the Langusge.

 2) Approach
    ========
    I am planning to use a hybrid module which primarily uses machine
learning approach for making the decision of removing/inserting
explicit 'definite' markers.

    On top of it (if required) a rule base layer, which can include
PROHIBITION rules or others.

 3) Architecture
    ============
    The task would require two modules.
    1) Module 1, which uses trained model to make prediction and
situated just before the morphological generator in the apertium
pipeline. It will contain a rule based sub-module enabling user to add
rules, based on the error analysis, which will be applied to the
machine predicted output.
    2) Module 2, a stand alone module which builds the model from a
raw language corpus.


                          Trained Lang. Model
                                         |
                          ________V________
    sequence+feature --->|    Module 1     |--->sequence'+feature
                         |_________________|



                    _________________
_________________
    Raw Corpus --->|    Morph        |---> Morph output stream  --->|
  Module 2     |--->Trained Lang. Model
                   |_________________|
|_________________|


 4) Learn form Corpus
    =================
    Idea is to learn a model which can predict existence/non-existence
of the of explicit 'definite' marker. This clearly demands a large
training corpus.

    Since for this task we only need to predict the definiteness
markers so developing a training corpus is comparatively an easy task.
        1. Every language has finite number of definiteness marker.
        2. Raw text is easy to get.


 5) Corpus Preparation
    ==================
    Procedure for preparing training corpus.
    1. Choose a representative raw corpus (Wikipedia is a good option)
for the language.
    2. Run apertium morph analyzer of the respective language on this corpus.
    3. Select and populate the training features and prediction label,
and arrange in a format, that could be fed to a classifier.

    Eg:- for English, the definiteness marker is "the", the file would
look like something

    Sentence: At the beginning of the year 2009
    Class Label : 1 if definiteness marker occur before the token, 0 otherwise

    LEMMA        FEATURE1(say POS)        FEATURE2(...)      CLASS
    -----                   -----------------
-------------              -----
    at                             pr                             X
                   0
    beginning                 vb                             Y
              1
    of                             pr                             Z
                   0
    year                          n                             P
                 1
    2009                       num                           N
              0


 6) Choice of Classifier
    ====================
    Plan to use CRF as the classifier to predict the presence of
'definite' marker. CRF has an established reputation for sequence
labeling tasks. In case the accuracies suffer then other available
options are - SVM, Bayes etc.

 7) Choice of Features
    ==================
    1) LEMMA - lemma in lower case
               Using lower case to reduce the vocab and using case
information as a separate feature instead. This will limit the vocab
and hence sparsity.
    2) POS   - POS tag of the token, use UNKNOWN if the POS tag is unknown.
    3) CASE  - alphabetic case of the lemma
               a. Initial Capital(IC)
                     eg. England, Stanford,
               b. All Upper(AU)
                     eg. MIT, CPU
               c. All Lower(AL)
                     eg. core,pen
               d. Number_Digits(Ni)
                     eg. 2009(N4), 100(N3)
               e. Alpha Numeric (AN)
                     eg. F16, b12, 99ace
               f. Others (O)
                     eg. . , - @ + '

 8) Learning Template
    =================
    An intelligent baseline should be to start with state of the art
learning template for chunking. Then tweak it to increase accuracy.
Difficult to methodalize, need a few hit and trials. Have done it
earlier for other problems.


 9) Post Editing - A rule based approach
    ====================================
    This sub-module, will be a rule base and facilitate overriding of
the prediction by the system. Rules can be written by doing error
analysis and finding frequent cases where the model makes mistake.


10) Comments
    ========
    1. Above description is provided for definiteness but this module
can be easily ported to other concepts like aspect etc.
    2. I feel linguistic feature could be added  based on the language.

--
GenX

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Reply via email to