El dc 12 de 09 de 2012 a les 15:35 +0200, en/na Per Tunedal va escriure: > Hi again, > unfortunately there are 50 000 lines with errors :-(
I've seen worse ;) > Apparently, it's infeasible to correct them all by hand. No it isn't. :) > Most errors are > of the @ type, word not in the bidix (and properly not in the Swedish > monodix either). $ cat /tmp/da-sv.testvoc | grep '@' | wc -l 43662 $ cat /tmp/da-sv.testvoc | grep '@' | grep '<n>' | wc -l 23618 Divide the number of nouns by 8 (each noun has ~8 forms -- sg/pl def/indef nom/gen) and you get 2952.25 ... it's possible to translate between 300-500 words/day, making it about a week's work to add the translations. $ cat /tmp/da-sv.testvoc | grep '@' | grep '<vblex' | wc -l 19774 There are 9 forms per verb, making it around 2197 translations, another week or so. Then the remaining ~270 errors shouldn't take more than a couple of days to fix. So, it's about half a month's work in total. > I'm thinking about some automatic solution. > 1) adding the Swedish words from SALDO (and words from is-sv). This is a bad idea -- you'll mess up the testvoc in the other direction. You should first work out the translations, then only add the ones you have translations for. > 2) building a bidix from bilingual corpora This would be ok, but in my experience, setting up Moses/GIZA/etc. for the first time can take up to a week. And in the end you have to postedit (check every word) in the lists anyway. > What's the most appropriate way to continue? I would just do it manually, it's really not that much work when you break it down into manageable chunks. Fran ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
