El dc 08 de 02 de 2012 a les 20:26 -0500, en/na Dan Loehr va escriure:
> Many thanks, Fran.  I won't be able to download and test the new
> version (apertium-mk-en-0.1.1.tar.gz) for a day or two.  But I did
> want to reply right away and say thank you.
>  
> You also asked for feedback on the quality.  You are probably already
> aware that it does very well compared to Google Translate.  Your
> online platform at apertium.org provides this translation of a section
> from the Macedonian version of the UN Declaration of Human Rights:
>  
> Since the recognition on врoдeнoтo dignity, and on the equal and
> нeoтуѓиви authentic on all members on the humanity are тeмeлитe on the
> freedom, the justice and the peace in the world; 
> 
> And here's Google Translate's translation of the same passage:
>  
> A great priznavanjeto Following the vrodenoto dostoinstvo, also in
> case of ednakvite and neotugjivi prava Following the all outdoor
> chlenovi Following the choveshtvoto everything temelite Following the
> slobodata, pravdata and mirot vo svetot;
> 
> Here's the UN's English version (available at
> http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=eng)
>  
> Whereas recognition of the inherent dignity and of the equal and
> inalienable rights of all members of the human family is the
> foundation of freedom, justice and peace in the world, 
>  
> (And here's the actual section translated (available at
> http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=mkj):
>  
> Бидejќи признaвaњeтo нa врoдeнoтo дoстoинствo, и нa eднaквитe и
> нeoтуѓиви прaвa нa ситe члeнoви нa чoвeштвoтo сe тeмeлитe нa
> слoбoдaтa, прaвдaтa и мирoт вo свeтoт;
>  
> So for 8-10 days' work, I'd say you've done quite well!
>  
> Thanks again,

Hmm, the poor result from Google is surprising and leads me to think
there is something else at play here. I'm sure they have the same corpus
I was working with 'SETimes'.  I would also be surprised if they haven't
used the UDHR in their training corpus too.

I just checked and the Macedonian input (from the UDHR) is full of Latin
characters, e.g. Latin 'o' instead of Cyrillic 'о', 'e' and 'a' the
same.

If we replace them with their Cyrillic counterparts, Google gets a much
better result:

--

Бидеjќи признавањето на вроденото достоинство, и на еднаквите и
неотуѓиви права на сите членови на човештвото се темелите на слободата,
правдата и мирот во светот;

Since they recognizing the inherent dignity and equal and inalienable
rights of all members of the human family is the foundation of freedom,
justice and peace in the world;

--

So, if you want a free/rule-based system then Apertium is probably what
you're looking for. And we'd definitely welcome further feedback and
development. Otherwise, if you want to make a vanilla SMT system, use
the SETimes corpus and make sure you sanitise your input on the
Macedonian side for unexpected Latin characters (in Apertium we have an
option to do it in the dictionary compilation stage).

Best regards,

Fran

PS. I'm really surprised Google isn't doing this for languages using
Cyrillic, having Latin characters pop up doesn't just happen in
Macedonian (sometimes from bad keyboard layouts, sometimes from bad OCR
software), but also in other languages with Cyrillic-based scripts,
Chuvash, Komi etc.


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to