Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Amr Mohamed Hosny Anwar Fri, 31 May 2019 16:18:22 -0700

Hi all,

I would like to share the status of implementing the supervised method of 
weighting of automata script.


First, I used this page for understanding the XEROX regex syntax "Syntax of 
Regular Expressions (Finite-State Calculus)": 
ftp://ftp.cis.upenn.edu/pub/cis639/public_html/docs/fssyntax.html

Then, I wrote a bash script (URL: 
https://github.com/AMR-KELEG/lttoolbox/tree/supervised-weighing-of-automata-experiment/scripts)
 whose inputs are:

- A tagged corpus (such as: 
https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
- A compiled FST in apertium's format

Currently, the script isn't working as lt-comp fails to compile FSTs with 
initial epsilon transitions or multiple FSTs in the same .att file.
I haven't inspect the source of the error yet.

The thrown exception is:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
  Aborted (core dumped)

I tried adding  [?*] at the start and the end of the regex but the script 
became much much slower.
I left it for about two hours and it was still doing computations!
I will have to estimate the complexity of the compose function to predict 
whether the script's running time is reasonable or not.

In order to keep all the paths that aren't part of the tagged corpus, I added a 
regex in the form [?*]::1000000.
So, Basically all the paths that are part of the input FST will have a very 
large weight (We should represent INF in a better way).
However, this generates the problem of having two analyses for any of the 
weighted regexps.
I will experiment with subtracting FSTs so that weighted analyses don't receive 
trivial unweighted analysis.
I have tested  the script using the apertium-eng FST and I am attaching the 
results of using the unweighted/weighted HFST FST.

Should I investigate the causes of the errors related to lttoolbox and may be 
implement methods that will help in compiling hfst-generated att files?
I believe we will need to figure out a way to weight FSTs regardless of the way 
these weights were generated.

Regards,
Amr


________________________________
From: Hèctor Alòs i Font <hectora...@gmail.com>
Sent: Wednesday, May 22, 2019 5:55 AM
To: [apertium-stuff]
Subject: Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - 
Implementing the supervised method of weighing autoamata

Hi Amr,

The files should be there. It was a mistake. I added them. Each line has a 
word. If something is not clear, the "raw" files have the original texts 
without morphological analysis.

About other manually tagged corpora, I cannot help. This is the only one we did 
in the projects I've been working. Unfortunately, we didn't find the time for 
creating a similar one for Sardinian two years ago.

Best,
Hèctor

Missatge de Amr Mohamed Hosny Anwar 
<amr.ke...@eng.asu.edu.eg<mailto:amr.ke...@eng.asu.edu.eg>> del dia dt., 21 de 
maig 2019 a les 19:05:

Hi Hector,

Yes, these files are for sure what I need.
However, it seems like these files (*.tagged.txt) aren't part of the upstream 
repository: https://github.com/apertium/apertium-oci/tree/master/texts

I am currently experimenting with the English and Italian tagged 
corpora/morphological analysers.
The more languages we have, the better we can compare between weighting 
methodologies.
I don't have a strong background in linguistics so I thought it'd be better if 
you can recommend me corpora from different diverse languages.

Thanks,
Amr

________________________________
From: Hèctor Alòs i Font <hectora...@gmail.com<mailto:hectora...@gmail.com>>
Sent: Tuesday, May 21, 2019 12:42:02 PM
To: [apertium-stuff]
Subject: Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - 
Implementing the supervised method of weighing autoamata

Hi Amr,

I'm not sure it may help you, but in apertium-oci/texts there are several texts 
in Occitan manually disambiguated. Aprox. 14,000 words. They are:
atom_gascon.tagged.txt
continent.tagged.txt
glacier.tagged.txt
cors_aran.tagged.txt
hlama_coming.tagged.txt
uranus_prov.tagged.txt

Best,
Hector

Missatge de Amr Mohamed Hosny Anwar 
<amr.ke...@eng.asu.edu.eg<mailto:amr.ke...@eng.asu.edu.eg>> del dia dg., 19 de 
maig 2019 a les 2:59:

Dear maintainers, contributors,

Hope this email finds you well.

This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.
After a fruitful discussion with my mentors Nick, Flammie and Francis, we have 
agreed on implementing the supervised way of weighing automata as follows:

The command will look like: lt-weight transducer.bin corpus.tagged

transducer.bin: A FST compiled using lttoolbox.
corpus.tagged: A tagged corpus that will be used to estimate the weights.

The weighting will be done by composing the main "unweighted" FST with a set of 
simple FSTs that are generated for each token.
A simplified example: If the main FST had an edge a:b::0 and the estimated 
weight for this edge is W, then The main FST will be composed with a simple FST 
of an edge b:b::w generating a new FST with an edge a:b::W.

To achieve this, I will create a new shell script that makes use of hfst's 
compose (Instead of implementing/adding a compose function to the lttoolbox). 
We will approve and use this approach if the prototype has proven to be 
functioning as expected.

The shell script will work as follows:
1) lt-print will be used to convert the FST to at&t format.
2) The weights will be estimated from the tagged corpus by counting the unigram 
lexical forms (A clever set of shell commands can do the job but I am not an 
expert in shell scripting so it will take me some time - I am open to 
suggestions/ sources/ examples for doing so).
3) For each weighted string, hfst-str2fst (or the corresponding regex version) 
will be used to generate simple FSTS.
4) The FSTs will be composed using hfst-compose.
5) The final FST will be converted to at&t format.
6) lt-comp will be be used to regenerate a weighted FST that is compatable with 
all the tools that rely on apertium.

In this version, We will just use unigram counts for the lexical forms to 
estimate the weights.
Additionally, The weight will be assigned to the final state and won't be 
distributed among the edges (We will most probably want to change this later).

On the other hand, I will try to improve the list of publications/ideas that 
will be used to weigh automata in an unsupervised way.
I would be grateful if you can share with me resources/ ideas regarding this 
part.

Finally, Do you have recommendations for tagged corpora that can be used 
throughout the project for benchmarking?
I am using this English Tagged corpus from the apertium-eng repository 
(https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
It would be better if we can do benchmarking on corpora and FSTs of different 
sizes and complexity.

Thanks and looking forward to hearing from you.
Your suggestions, feedback, feature requests are more than welcome.

Best Regards,
Amr

________________________________
From: Amr Mohamed Hosny Anwar
Sent: Sunday, May 19, 2019 12:50:52 AM
To: apertium-stuff
Cc: nlhow...@gmail.com<mailto:nlhow...@gmail.com>
Subject: GSoC 19: Unsupervised weighting of automata - Implementing the 
supervised method of weighing autoamata


Dear maintainers, contributors,

Hope this email finds you well.

This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.



Best Regards,
Amr Keleg

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

test_data
Description: test_data

unweighted_results
Description: unweighted_results

weighted_results
Description: weighted_results

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Reply via email to