Hi Hector,

Yes, these files are for sure what I need.
However, it seems like these files (*.tagged.txt) aren't part of the upstream 
repository: https://github.com/apertium/apertium-oci/tree/master/texts

I am currently experimenting with the English and Italian tagged 
corpora/morphological analysers.
The more languages we have, the better we can compare between weighting 
methodologies.
I don't have a strong background in linguistics so I thought it'd be better if 
you can recommend me corpora from different diverse languages.

Thanks,
Amr

________________________________
From: H?ctor Al?s i Font <hectora...@gmail.com>
Sent: Tuesday, May 21, 2019 12:42:02 PM
To: [apertium-stuff]
Subject: Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - 
Implementing the supervised method of weighing autoamata

Hi Amr,

I'm not sure it may help you, but in apertium-oci/texts there are several texts 
in Occitan manually disambiguated. Aprox. 14,000 words. They are:
atom_gascon.tagged.txt
continent.tagged.txt
glacier.tagged.txt
cors_aran.tagged.txt
hlama_coming.tagged.txt
uranus_prov.tagged.txt

Best,
Hector

Missatge de Amr Mohamed Hosny Anwar 
<amr.ke...@eng.asu.edu.eg<mailto:amr.ke...@eng.asu.edu.eg>> del dia dg., 19 de 
maig 2019 a les 2:59:

Dear maintainers, contributors,

Hope this email finds you well.

This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.
After a fruitful discussion with my mentors Nick, Flammie and Francis, we have 
agreed on implementing the supervised way of weighing automata as follows:

The command will look like: lt-weight transducer.bin corpus.tagged

transducer.bin: A FST compiled using lttoolbox.
corpus.tagged: A tagged corpus that will be used to estimate the weights.

The weighting will be done by composing the main "unweighted" FST with a set of 
simple FSTs that are generated for each token.
A simplified example: If the main FST had an edge a:b::0 and the estimated 
weight for this edge is W, then The main FST will be composed with a simple FST 
of an edge b:b::w generating a new FST with an edge a:b::W.

To achieve this, I will create a new shell script that makes use of hfst's 
compose (Instead of implementing/adding a compose function to the lttoolbox). 
We will approve and use this approach if the prototype has proven to be 
functioning as expected.

The shell script will work as follows:
1) lt-print will be used to convert the FST to at&t format.
2) The weights will be estimated from the tagged corpus by counting the unigram 
lexical forms (A clever set of shell commands can do the job but I am not an 
expert in shell scripting so it will take me some time - I am open to 
suggestions/ sources/ examples for doing so).
3) For each weighted string, hfst-str2fst (or the corresponding regex version) 
will be used to generate simple FSTS.
4) The FSTs will be composed using hfst-compose.
5) The final FST will be converted to at&t format.
6) lt-comp will be be used to regenerate a weighted FST that is compatable with 
all the tools that rely on apertium.

In this version, We will just use unigram counts for the lexical forms to 
estimate the weights.
Additionally, The weight will be assigned to the final state and won't be 
distributed among the edges (We will most probably want to change this later).

On the other hand, I will try to improve the list of publications/ideas that 
will be used to weigh automata in an unsupervised way.
I would be grateful if you can share with me resources/ ideas regarding this 
part.

Finally, Do you have recommendations for tagged corpora that can be used 
throughout the project for benchmarking?
I am using this English Tagged corpus from the apertium-eng repository 
(https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
It would be better if we can do benchmarking on corpora and FSTs of different 
sizes and complexity.

Thanks and looking forward to hearing from you.
Your suggestions, feedback, feature requests are more than welcome.

Best Regards,
Amr

________________________________
From: Amr Mohamed Hosny Anwar
Sent: Sunday, May 19, 2019 12:50:52 AM
To: apertium-stuff
Cc: nlhow...@gmail.com<mailto:nlhow...@gmail.com>
Subject: GSoC 19: Unsupervised weighting of automata - Implementing the 
supervised method of weighing autoamata


Dear maintainers, contributors,

Hope this email finds you well.

This mail can be considered as a status report for detailing next week's plan 
in addition to seeking feedback/ suggestions regarding the project.



Best Regards,
Amr Keleg

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to