Hi all, I would like to share the status of implementing the supervised method of weighting of automata script.
First, I used this page for understanding the XEROX regex syntax "Syntax of Regular Expressions (Finite-State Calculus)": ftp://ftp.cis.upenn.edu/pub/cis639/public_html/docs/fssyntax.html Then, I wrote a bash script (URL: https://github.com/AMR-KELEG/lttoolbox/tree/supervised-weighing-of-automata-experiment/scripts) whose inputs are: - A tagged corpus (such as: https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged) - A compiled FST in apertium's format Currently, the script isn't working as lt-comp fails to compile FSTs with initial epsilon transitions or multiple FSTs in the same .att file. I haven't inspect the source of the error yet. The thrown exception is: terminate called after throwing an instance of 'std::invalid_argument' what(): stoi Aborted (core dumped) I tried adding [?*] at the start and the end of the regex but the script became much much slower. I left it for about two hours and it was still doing computations! I will have to estimate the complexity of the compose function to predict whether the script's running time is reasonable or not. In order to keep all the paths that aren't part of the tagged corpus, I added a regex in the form [?*]::1000000. So, Basically all the paths that are part of the input FST will have a very large weight (We should represent INF in a better way). However, this generates the problem of having two analyses for any of the weighted regexps. I will experiment with subtracting FSTs so that weighted analyses don't receive trivial unweighted analysis. I have tested the script using the apertium-eng FST and I am attaching the results of using the unweighted/weighted HFST FST. Should I investigate the causes of the errors related to lttoolbox and may be implement methods that will help in compiling hfst-generated att files? I believe we will need to figure out a way to weight FSTs regardless of the way these weights were generated. Regards, Amr ________________________________ From: Hèctor Alòs i Font <hectora...@gmail.com> Sent: Wednesday, May 22, 2019 5:55 AM To: [apertium-stuff] Subject: Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata Hi Amr, The files should be there. It was a mistake. I added them. Each line has a word. If something is not clear, the "raw" files have the original texts without morphological analysis. About other manually tagged corpora, I cannot help. This is the only one we did in the projects I've been working. Unfortunately, we didn't find the time for creating a similar one for Sardinian two years ago. Best, Hèctor Missatge de Amr Mohamed Hosny Anwar <amr.ke...@eng.asu.edu.eg<mailto:amr.ke...@eng.asu.edu.eg>> del dia dt., 21 de maig 2019 a les 19:05: Hi Hector, Yes, these files are for sure what I need. However, it seems like these files (*.tagged.txt) aren't part of the upstream repository: https://github.com/apertium/apertium-oci/tree/master/texts I am currently experimenting with the English and Italian tagged corpora/morphological analysers. The more languages we have, the better we can compare between weighting methodologies. I don't have a strong background in linguistics so I thought it'd be better if you can recommend me corpora from different diverse languages. Thanks, Amr ________________________________ From: Hèctor Alòs i Font <hectora...@gmail.com<mailto:hectora...@gmail.com>> Sent: Tuesday, May 21, 2019 12:42:02 PM To: [apertium-stuff] Subject: Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata Hi Amr, I'm not sure it may help you, but in apertium-oci/texts there are several texts in Occitan manually disambiguated. Aprox. 14,000 words. They are: atom_gascon.tagged.txt continent.tagged.txt glacier.tagged.txt cors_aran.tagged.txt hlama_coming.tagged.txt uranus_prov.tagged.txt Best, Hector Missatge de Amr Mohamed Hosny Anwar <amr.ke...@eng.asu.edu.eg<mailto:amr.ke...@eng.asu.edu.eg>> del dia dg., 19 de maig 2019 a les 2:59: Dear maintainers, contributors, Hope this email finds you well. This mail can be considered as a status report for detailing next week's plan in addition to seeking feedback/ suggestions regarding the project. After a fruitful discussion with my mentors Nick, Flammie and Francis, we have agreed on implementing the supervised way of weighing automata as follows: The command will look like: lt-weight transducer.bin corpus.tagged transducer.bin: A FST compiled using lttoolbox. corpus.tagged: A tagged corpus that will be used to estimate the weights. The weighting will be done by composing the main "unweighted" FST with a set of simple FSTs that are generated for each token. A simplified example: If the main FST had an edge a:b::0 and the estimated weight for this edge is W, then The main FST will be composed with a simple FST of an edge b:b::w generating a new FST with an edge a:b::W. To achieve this, I will create a new shell script that makes use of hfst's compose (Instead of implementing/adding a compose function to the lttoolbox). We will approve and use this approach if the prototype has proven to be functioning as expected. The shell script will work as follows: 1) lt-print will be used to convert the FST to at&t format. 2) The weights will be estimated from the tagged corpus by counting the unigram lexical forms (A clever set of shell commands can do the job but I am not an expert in shell scripting so it will take me some time - I am open to suggestions/ sources/ examples for doing so). 3) For each weighted string, hfst-str2fst (or the corresponding regex version) will be used to generate simple FSTS. 4) The FSTs will be composed using hfst-compose. 5) The final FST will be converted to at&t format. 6) lt-comp will be be used to regenerate a weighted FST that is compatable with all the tools that rely on apertium. In this version, We will just use unigram counts for the lexical forms to estimate the weights. Additionally, The weight will be assigned to the final state and won't be distributed among the edges (We will most probably want to change this later). On the other hand, I will try to improve the list of publications/ideas that will be used to weigh automata in an unsupervised way. I would be grateful if you can share with me resources/ ideas regarding this part. Finally, Do you have recommendations for tagged corpora that can be used throughout the project for benchmarking? I am using this English Tagged corpus from the apertium-eng repository (https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged) It would be better if we can do benchmarking on corpora and FSTs of different sizes and complexity. Thanks and looking forward to hearing from you. Your suggestions, feedback, feature requests are more than welcome. Best Regards, Amr ________________________________ From: Amr Mohamed Hosny Anwar Sent: Sunday, May 19, 2019 12:50:52 AM To: apertium-stuff Cc: nlhow...@gmail.com<mailto:nlhow...@gmail.com> Subject: GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata Dear maintainers, contributors, Hope this email finds you well. This mail can be considered as a status report for detailing next week's plan in addition to seeking feedback/ suggestions regarding the project. Best Regards, Amr Keleg _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net<mailto:Apertium-stuff@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/apertium-stuff _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net<mailto:Apertium-stuff@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
test_data
Description: test_data
unweighted_results
Description: unweighted_results
weighted_results
Description: weighted_results
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff