Hi Amr, a solid proposal and coding challenge, some comments inline:
On Thu, Mar 28, 2019 at 09:08:25PM +0000, Amr Mohamed Hosny Anwar wrote: > Dear all, > > Kindly find a draft of my proposal for the "Unsupervised weighting of > automata". > http://wiki.apertium.org/wiki/User:AMR-KELEG Few points ot the schedule * a week here and there for research is ok, but we want to be able to track progress, so experimenting and documenting would be a part of those weeks * for the final part, it is important to allocate enough time for the integration of the project to apertium system, ideally successful project ends with a tool that all apertium language developers can integrate to their languages without significant effort > I believe that I will need to target a set of published papers to implement > throughout the project. > However, I am having trouble finding useful set of publications for the task. > I'd be grateful if you could help me by recommending some publications or > even keywords to look for. > I am currently exploring papers related to spectral learning but I don't know > whether this topic is related to the task or not. This is an important point and I agree it should be the workflow to follow some reference implementations and documentations. I saw the spectral FST one but I have not tried it so I have no idea of the complexity or suitability yet. I hope someone with more experience on unsupervision can comment as well. I think one thing that can be started as baseline from is a model that just counts things from ambiguous results, symbols and tags and lemmas. I think there's also some implementation and maybe a paper about counting the arc visitations on state visitations in the analysis traversals. Few more thoughts: * what happens to unseen stuffs? They need to be very unlikely but still possible in the final re-weighted model * to that point, most languages have infinite vocabuylaries with compounding and stuff, e.g. you can write manbearpig and it might not be in corpus but we think it's more likely than zirconiumkumqvattaxi, this is not necessary for the project but can be kept in sight * I think we should measure some baselines from other mthods, e.g. the apertium's current statistical analysers and keep track of progress agaisnt those throughout summer * I don't have any good pointers for the background, maybe check through what other fst folk have done: http://www.opengrm.org/twiki/bin/view/GRM/WebHome https://aclweb.org/aclwiki/SIGFSM -- Doktor Tommi A Pirinen, Computational Linguist, <https://flammie.github.io/purplemonkeydishwasher/>, Universität Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D Entwickler. President of ACL SIGUR SIG for Uralic languages <http://gtweb.uit.no/sigur/>. I tend to follow inline-posting style in desktop e-mail messages.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff