A 2014-06-14 10:31, Andrei Sfrent escrigué:
> Hi,
>
> My name is Andrei Sfrent, I am studying for a Master's Degree in
> Machine Learning at Imperial College London and I am looking for
> projects that I could implement for my MSc.
>
> I am already in contact with Apertium through GSoC and I thought the
> project would be a good opportunity to apply Machine Learning / NLP to
> real world problems. I would really appreciate some pointers or
> project ideas that would combine research and experimentation in a
> suitable way for a 2.5 months MSc thesis.
Here are some suggestions that I can come up with... I'm not sure how
many of them could be done in 2.5 months (does that include the write-up
time too?)
1) New part-of-speech tagger with target-language tagger training:
Choose a machine learning tool (MaxEnt, CRF, SVM, ...)
Write a part-of-speech tagger (can be just a prototype in python or
something)
Implement target-language tagger training, like Felipe did as part
of his thesis.
http://www.springerlink.com/content/m452802q3536044v/fulltext.pdf
(contact Felipe if you don't have access)
It would be good to be able to decompose analyses into linguistic
features and then feed them to the tagger, e.g. instead of trying to
estimate parameters for all tags combined at once e.g. in
^örneğin/örnek<n><gen>
/örneğin<cnjadv>
/örnek<n><px2sg><nom>
/örnek<n><px2sg><nom>+i<cop><aor><p3><sg>$
each of these tag strings, you could define "case" as (nom|gen),
"cop" as (cop|ø), "possession" as (px2sg|px1sg|px3sp) etc., then combine
prob(n|context) * prob(possession|context) * prob(cop|context) etc. I
don't know the specifics of how you might do this, but probably Mikel
will know :)
2) Corpus-based lexicalised-feature transfer:
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Corpus-based_lexicalised_feature_transfer
See the "see also". Again, you could use your favourite machine
learning tool here.
-- A possibility for Romanian would be when to use "pe" with a direct
object.
3) Feature pruning for the lexical selection module.
http://rua.ua.es/dspace/bitstream/10045/35848/1/thesis_FrancisMTyers.pdf
You could implement a few ideas on how to prune features which have
been learnt by the unsupervised training of the lexical selection
module.
4) MERT for the lexical selection module.
Implement minimum error rate training to tune the weights of the
rules learnt by the lexical selection module.
If you have any further questions I would be happy to elucidate more.
F.
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff