Hi, On Wed, Sep 12, 2012, at 15:49, Francis Tyers wrote: > El dc 12 de 09 de 2012 a les 15:35 +0200, en/na Per Tunedal va escriure: > > Hi again, > > unfortunately there are 50 000 lines with errors :-( > > I've seen worse ;) > > > Apparently, it's infeasible to correct them all by hand. > > No it isn't. :) > > > Most errors are > > of the @ type, word not in the bidix (and properly not in the Swedish > > monodix either). > > $ cat /tmp/da-sv.testvoc | grep '@' | wc -l > 43662 > > $ cat /tmp/da-sv.testvoc | grep '@' | grep '<n>' | wc -l > 23618 > > Divide the number of nouns by 8 (each noun has ~8 forms -- sg/pl > def/indef nom/gen) and you get 2952.25 ... it's possible to translate > between 300-500 words/day, making it about a week's work to add the > translations.
If someone made a simple GUI where you could edit the dictionaries you might translate 1000 words a day! Or maybe 2000? > > $ cat /tmp/da-sv.testvoc | grep '@' | grep '<vblex' | wc -l > 19774 > > There are 9 forms per verb, making it around 2197 translations, another > week or so. > > Then the remaining ~270 errors shouldn't take more than a couple of days > to fix. > > So, it's about half a month's work in total. Well, as Danish isn't my main interest, it'll have to wait. > > > I'm thinking about some automatic solution. > > 1) adding the Swedish words from SALDO (and words from is-sv). > > This is a bad idea -- you'll mess up the testvoc in the other direction. > You should first work out the translations, then only add the ones you > have translations for. What about some tool to automatically strip the monodix of the offending words? > > > 2) building a bidix from bilingual corpora > > This would be ok, but in my experience, setting up Moses/GIZA/etc. for > the first time can take up to a week. And in the end you have to > postedit (check every word) in the lists anyway. Would be very much faster, as someone have already pointed out. I might try this in the future as I will install Moses anyway for the pair Swedish (se) - French (fr). > > > What's the most appropriate way to continue? > > I would just do it manually, it's really not that much work when you > break it down into manageable chunks. Well, but it's kind of silly work, copying around text lines and changing them to your needs. I simply don't like it. And it's easy to forget changing something and produce unnecessary errors. (BTW I once got a letter from the church with an invoice for a tomb attached. The wording was such that I thought that my grandfather had deceased. I was chocked, but within an hour I found out that he was alright. The church employee simply had made a mistake: she had copied an old Word document as a draft for the mail. You could just imagine!) > > Fran > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Apertium-stuff mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/apertium-stuff ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
