Re: [Apertium-stuff] MSc project

Francis Tyers Sat, 14 Jun 2014 11:54:07 -0700

A 2014-06-14 10:31, Andrei Sfrent escrigué:
> Hi,
> 
> My name is Andrei Sfrent, I am studying for a Master's Degree in
> Machine Learning at Imperial College London and I am looking for
> projects that I could implement for my MSc.
> 
> I am already in contact with Apertium through GSoC and I thought the
> project would be a good opportunity to apply Machine Learning / NLP to
> real world problems. I would really appreciate some pointers or
> project ideas that would combine research and experimentation in a
> suitable way for a 2.5 months MSc thesis.


Here are some suggestions that I can come up with... I'm not sure how 
many of them could be done in 2.5 months (does that include the write-up 
time too?)

1) New part-of-speech tagger with target-language tagger training:

    Choose a machine learning tool (MaxEnt, CRF, SVM, ...)
    Write a part-of-speech tagger (can be just a prototype in python or 
something)
    Implement target-language tagger training, like Felipe did as part
     of his thesis.
    http://www.springerlink.com/content/m452802q3536044v/fulltext.pdf
    (contact Felipe if you don't have access)
    It would be good to be able to decompose analyses into linguistic 
features and then feed them to the tagger, e.g. instead of trying to 
estimate parameters for all tags combined at once e.g. in

^örneğin/örnek<n><gen>
         /örneğin<cnjadv>
         /örnek<n><px2sg><nom>
         /örnek<n><px2sg><nom>+i<cop><aor><p3><sg>$

     each of these tag strings, you could define "case" as (nom|gen), 
"cop" as (cop|ø), "possession" as (px2sg|px1sg|px3sp) etc., then combine 
prob(n|context) * prob(possession|context) * prob(cop|context) etc. I 
don't know the specifics of how you might do this, but probably Mikel 
will know :)

2) Corpus-based lexicalised-feature transfer:
    
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Corpus-based_lexicalised_feature_transfer
    See the "see also". Again, you could use your favourite machine 
learning tool here.
    -- A possibility for Romanian would be when to use "pe" with a direct 
object.

3) Feature pruning for the lexical selection module.
    
http://rua.ua.es/dspace/bitstream/10045/35848/1/thesis_FrancisMTyers.pdf
    You could implement a few ideas on how to prune features which have 
been learnt by the unsupervised training of the lexical selection 
module.

4) MERT for the lexical selection module.
    Implement minimum error rate training to tune the weights of the 
rules learnt by the lexical selection module.

If you have any further questions I would be happy to elucidate more.

F.

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] MSc project

Reply via email to