Per Tunedal <[email protected]> writes: > Hi, > I've successfully extracted a Swedish word list from > apertium.sv-da.sv.dix as follows: > > lt-expand apertium-sv-da.sv.dix | cut -f1 -d':' > > apertium-sv-da.sv.dix.expanded > > Going through the list I found lots of errors. I excluded words present > in the Aspell dictionary to get a shorter list of misspelled words. It > was quite long though, and worse: it contained mostly correctly spelled > words, unknown to Aspell. Hunspell (used by e.g. OpenOffice/Libre > Office) knows much more words. Anyone that happens to know how to > extract/get Hunspell word lists as text files? > > Looking at the misspelled list I realised that many of "the errors" are > variants added for analysis only (r="LR"). Is there an easy way to > expand only the variants that are used for generation? Such a procedure > would produce a much shorter and more correct list.
LR entries are output from lt-expand with :>: as the field separator, so you can do lt-expand *.sv.dix | grep -v ':>:' | cut -f1 -d: > sv.expanded You might also want to exclude RL-marked entries (they tend to be a bit weird in monodixes): lt-expand *.sv.dix | grep -v ':[<>]:' | cut -f1 -d: > sv.expanded > Anyhow, I continued by checking the list in Word-processing programs to > get the real errors and found quite a lot. Some of them have I already > corrected in the pair sv-da. What about the separate language > dictionary? Should I merge my corrections somehow? What's the > recommended procedure when improving/adding to an existing language > pair? It'd be great if you could merge your changes in there; before your changes the diff was only 32 lines long so I don't think it should be much work (you might even be able to just copy it over). > By the way: How do I use the separated language monodixies? Can they be > used for existing pairs or only when creating new pairs? What's the > recommendation for new pairs? The "Apertium New Language Pair HOWTO" > still supposes that the monodixies are made exclusively for the new > pair. The challenge is just getting the monodixes merged; if you merge in those changes, we can make apertium-sv-da depend on languages/apertium-swe with a little change to the makefiles. (The diff for the Danish side is 67736 lines long, so that may be more of a challenge to merge … but I'd still say it's worth it to merge the Swedish side right away.) -- Kevin Brubeck Unhammer GPG: 0x766AC60C
signature.asc
Description: PGP signature
------------------------------------------------------------------------------ Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
