Re: [Apertium-stuff] GSoC 2019 project discussion - Unsupervised weighting of automata

Tommi A Pirinen Fri, 05 Apr 2019 07:06:52 -0700

Hi Amr, 

a solid proposal and coding challenge, some comments inline:

On Thu, Mar 28, 2019 at 09:08:25PM +0000, Amr Mohamed Hosny Anwar wrote:
> Dear all,
> 
> Kindly find a draft of my proposal for the "Unsupervised weighting of 
> automata".
> http://wiki.apertium.org/wiki/User:AMR-KELEG

Few points ot the schedule

* a week here and there for research is ok, but we want to be able to
  track progress, so experimenting and documenting would be a part of
  those weeks
* for the final part, it is important to allocate enough time for the
  integration of the project to apertium system, ideally successful
  project ends with a tool that all apertium language developers can
  integrate to their languages without significant effort

> I believe that I will need to target a set of published papers to implement 
> throughout the project.
> However, I am having trouble finding useful set of publications for the task.
> I'd be grateful if you could help me by recommending some publications or 
> even keywords to look for.
> I am currently exploring papers related to spectral learning but I don't know 
> whether this topic is related to the task or not.

This is an important point and I agree it should be the workflow to
follow some reference implementations and documentations. I saw the
spectral FST one but I have not tried it so I have no idea of the
complexity or suitability yet. I hope someone with more experience on
unsupervision can comment as well. I think one thing that can be started
as baseline from is a model that just counts things from ambiguous
results, symbols and tags and lemmas. I think there's also some
implementation and maybe a paper about counting the arc visitations on
state visitations in the analysis traversals.

Few more thoughts:

* what happens to unseen stuffs? They need to be very unlikely but still
  possible in the final re-weighted model
* to that point, most languages have infinite vocabuylaries with
  compounding and stuff, e.g. you can write manbearpig and it might not
  be in corpus but we think it's more likely than zirconiumkumqvattaxi,
  this is not necessary for the project but can be kept in sight
* I think we should measure some baselines from other mthods, e.g. the
  apertium's current statistical analysers and keep track of progress
  agaisnt those throughout summer
* I don't have any good pointers for the background, maybe check through
  what other fst folk have done:

  http://www.opengrm.org/twiki/bin/view/GRM/WebHome
  https://aclweb.org/aclwiki/SIGFSM

-- 
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 2019 project discussion - Unsupervised weighting of automata

Reply via email to