Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Amr Mohamed Hosny Anwar Sat, 18 May 2019 16:59:36 -0700

Dear maintainers, contributors,

Hope this email finds you well.


This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.
After a fruitful discussion with my mentors Nick, Flammie and Francis, we have 
agreed on implementing the supervised way of weighing automata as follows:

The command will look like: lt-weight transducer.bin corpus.tagged

transducer.bin: A FST compiled using lttoolbox.
corpus.tagged: A tagged corpus that will be used to estimate the weights.

The weighting will be done by composing the main "unweighted" FST with a set of 
simple FSTs that are generated for each token.
A simplified example: If the main FST had an edge a:b::0 and the estimated 
weight for this edge is W, then The main FST will be composed with a simple FST 
of an edge b:b::w generating a new FST with an edge a:b::W.

To achieve this, I will create a new shell script that makes use of hfst's 
compose (Instead of implementing/adding a compose function to the lttoolbox). 
We will approve and use this approach if the prototype has proven to be 
functioning as expected.

The shell script will work as follows:
1) lt-print will be used to convert the FST to at&t format.
2) The weights will be estimated from the tagged corpus by counting the unigram 
lexical forms (A clever set of shell commands can do the job but I am not an 
expert in shell scripting so it will take me some time - I am open to 
suggestions/ sources/ examples for doing so).
3) For each weighted string, hfst-str2fst (or the corresponding regex version) 
will be used to generate simple FSTS.
4) The FSTs will be composed using hfst-compose.
5) The final FST will be converted to at&t format.
6) lt-comp will be be used to regenerate a weighted FST that is compatable with 
all the tools that rely on apertium.

In this version, We will just use unigram counts for the lexical forms to 
estimate the weights.
Additionally, The weight will be assigned to the final state and won't be 
distributed among the edges (We will most probably want to change this later).

On the other hand, I will try to improve the list of publications/ideas that 
will be used to weigh automata in an unsupervised way.
I would be grateful if you can share with me resources/ ideas regarding this 
part.

Finally, Do you have recommendations for tagged corpora that can be used 
throughout the project for benchmarking?
I am using this English Tagged corpus from the apertium-eng repository 
(https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
It would be better if we can do benchmarking on corpora and FSTs of different 
sizes and complexity.

Thanks and looking forward to hearing from you.
Your suggestions, feedback, feature requests are more than welcome.

Best Regards,
Amr

________________________________
From: Amr Mohamed Hosny Anwar
Sent: Sunday, May 19, 2019 12:50:52 AM
To: apertium-stuff
Cc: nlhow...@gmail.com
Subject: GSoC 19: Unsupervised weighting of automata - Implementing the 
supervised method of weighing autoamata


Dear maintainers, contributors,

Hope this email finds you well.

This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.



Best Regards,
Amr Keleg

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Reply via email to