Re: [Apertium-stuff] Paradigms in Bidixes

Francis Tyers Sun, 11 Nov 2012 07:10:05 -0800

El dg 11 de 11 de 2012 a les 15:52 +0100, en/na Per Tunedal va escriure:
> Hi,
> 
> On Sun, Nov 11, 2012, at 14:27, Francis Tyers wrote:
> > El dg 11 de 11 de 2012 a les 14:11 +0100, en/na Per Tunedal va escriure:
> > > Hi,




> 
> > 
> > > > 
> > > > > Further:
> > > > > 
> > > > > I am reflecting on the best way of treating prefixes, used to change 
> > > > > the
> > > > > meaning of a word. First I thought of attacking it as a compound, but
> > > > > I'm not sure that's the best way. Maybe something like your example
> > > > > would be better? Or even a third solution?
> > > > 
> > > > Don't do it. Work on stuff that is really going to effect the quality of
> > > > the translation. 
> > > 
> > > Well, the most blatant errors are.
> > > 1. Low word coverage. And I just wanted to try a solution that quickly
> > > increases the coverage. Then there wouldn't be any panic for adding more
> > > words, but it would increase the translation quality (and speed) one
> > > step further. It would be a pleasure, not a plight to add new words. 
> > 
> > There is no solution that quickly improves the coverage, without quickly
> > adding words. If you can't manage adding a few words, then I think that
> > MT is not for you.
> 
> Well, I can always fall back on statistical MT, couldn't I? All the
> same, I would like to try out Apertium. Rule based translation is
> interesting: I learn more about languages at the same time as I learn
> about Apertium.

Yes, that's what people who don't like learning about languages, and
working with dictionaries do. They use SMT. Me, I prefer learning about
languages :) It's fun!!

> > 
> > > 2. Strange errors probably due to mistakes made by the tagger. And
> > > you've told me that it isn't any use to train the tagger before adding
> > > some 20 000 words. That would take me some 20 years. It's simply out of
> > > the question.
> > 
> > If you think adding 20,000 words would take 20 years then you must be a
> > very slow worker. For me, it would take about two months full time, or 6
> > months part time. Perhaps a year, working for an hour/day Are you really
> > saying that you are more than 20--40 times slower than me ? I mean, it's
> > a fairly simple task, I find it hard to believe that there could be such
> > a huge difference in productivity.
> > 
> > Try to measure your productivity over an hour -- or half an hour. And
> > tell us how much it is, and how you've been working -- how you approach
> > the task. It could be that you are just working really inefficiently and
> > we can help you get up to normal speed.
> 
> Well. The largest problem is that I have a very limited knowledge of
> Danish and not much resources available. 

Translate from Danish to Swedish. Use the Europarl parallel corpus. You
can quite easily take a frequency list of missing words, and build a
concordance for each word using the corpus.

> My main goal is to translate Norwegian: I have by now acquired some 
> interesting books and done a
> short course at the University in Norwegian.

Great. 

> The second problem is that I hate editing XML-files, as it's so easy to
> make mistakes. And I have to learn a lot of codes/tags that I'm not
> really interested in. But I will manage. I have printed the Apertium
> manual and will read it. I hope it will help.

Very few people actually write the XML from scratch. Normally what I do
is make some kind of spreadsheety type list, 

a, b, adj, adj.sint
c, d, n.f, n.m
e, f, vblex, vblex

and then use a simple bash or python script to convert it to XML:

for w in `cat list | sed 's/ /_/g'`; do
  row=`echo $w | sed 's/_//g'`; 
  sl=`echo $row | cut -f1 -d','`;
  tl=`echo $row | cut -f2 -d','`;
  st=`echo $row | cut -f3 -d',' | sed 's/\./"\/><s n="/g'`;
  tt=`echo $row | cut -f4 -d',' | sed 's/\./"\/><s n="/g'`;

  echo '<e><p><l>'$sl'<s n="'$st'"/></l><r>'$tl'<s
n="'$tt'"/></r></p></e>';
done


> > > > 
> > > > Work from frequency and add them word at a time. Do not try and work
> > > > with derivational morphology while the coverage is so low. 
> > > 
> > > As I've said: Why not? What's the drawback?
> > 
> > The drawback is that it is unpredictable, and you end up with crappy
> > dictionaries. Even the current compounding mechanism, between two
> > languages like Dutch and Afrikaans is only around 90% accurate. And that
> > is for noun-noun compounds, which are the most predictable. If you start
> > to add derivation, you will decrease accuracy, probably to the point
> > where it causes more problems than it solves.
> 
> OK. Thank you for explaining. As you've probably noted, I always ask
> "why". I never take anything for granted. That way I learn a lot and
> avoid doing stupid things: just because everyone always has done things
> in  some way, it doesn't mean that it's the best way,  nor that it's the
> way that suits me best.
> 
> I plan to do some more improvements to the pair Swedish-Danish (se-da)

sv! :)

> and then start working with Norwegian - Swedish (no-sv).

Great!

Fran


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Paradigms in Bidixes

Reply via email to