Hi,

On Wed, Sep 12, 2012, at 15:49, Francis Tyers wrote:
> El dc 12 de 09 de 2012 a les 15:35 +0200, en/na Per Tunedal va escriure:
> > Hi again,
> > unfortunately there are 50 000 lines with errors :-(
> 
> I've seen worse ;)
> 
> > Apparently, it's infeasible to correct them all by hand. 
> 
> No it isn't. :)
> 
> > Most errors are
> > of the @ type, word not in the bidix (and properly not in the Swedish
> > monodix either).
> 
> $ cat /tmp/da-sv.testvoc | grep '@' | wc -l
> 43662
> 
> $ cat /tmp/da-sv.testvoc | grep '@' | grep '<n>' | wc -l
> 23618
> 
> Divide the number of nouns by 8 (each noun has ~8 forms -- sg/pl
> def/indef nom/gen) and you get 2952.25 ... it's possible to translate
> between 300-500 words/day, making it about a week's work to add the
> translations.

If someone made a simple GUI where you could edit the dictionaries you
might translate 1000 words a day! Or maybe 2000?

> 
> $ cat /tmp/da-sv.testvoc | grep '@' | grep '<vblex' | wc -l
> 19774
> 
> There are 9 forms per verb, making it around 2197 translations, another
> week or so. 
> 
> Then the remaining ~270 errors shouldn't take more than a couple of days
> to fix.
> 
> So, it's about half a month's work in total.

Well, as Danish isn't my main interest, it'll have to wait.

> 
> > I'm thinking about some automatic solution.
> > 1) adding the Swedish words from SALDO (and words from is-sv).
> 
> This is a bad idea -- you'll mess up the testvoc in the other direction.
> You should first work out the translations, then only add the ones you
> have translations for.

What about some tool to automatically strip the monodix of the offending
words?

> 
> > 2) building a bidix from bilingual corpora
> 
> This would be ok, but in my experience, setting up Moses/GIZA/etc. for
> the first time can take up to a week. And in the end you have to
> postedit (check every word) in the lists anyway.

Would be very much faster, as someone have already pointed out. I might
try this in the future as I will install Moses anyway for the pair
Swedish (se) - French (fr).
> 
> > What's the most appropriate way to continue?
> 
> I would just do it manually, it's really not that much work when you
> break it down into manageable chunks.

Well, but it's kind of silly work, copying around text lines and
changing them to your needs. I simply don't like it. And it's easy to
forget changing something and produce unnecessary errors. (BTW I once
got a letter from the church with an invoice for a tomb attached. The
wording was such that I thought that my grandfather had deceased. I was
chocked, but within an hour I found out that he was alright. The church
employee simply had made a mistake: she had copied an old Word document
as a draft for the mail. You could just imagine!)

> 
> Fran
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to