Re: [Apertium-stuff] paper on combining rules + statistics in POS tagging

Francis Tyers Sun, 17 Jun 2012 06:32:02 -0700

El ds 16 de 06 de 2012 a les 16:48 +0200, en/na Mikel Forcada va
escriure:
> Thanks Fran!
> >


> > 20 hours (very little time!) writing disambiguation rules gives
> > substantial improvements.
> I have added the reference to page 
> http://wiki.apertium.org/wiki/Constraint_Grammar (External Links).
> 
> I just want to call the attention to the fact that some of the rules 
> used by these authors could be written in "canonical", CG3-free Apertium 
> as "forbid" rules in .tsx files.
> 
> For instance, the rule
> 
> REMOVE (DET) IF (1C (VFIN));
> 
> corresponds to forbid rules we use in .tsx files (see, e.g. 
> apertium-es-ca.es.tsx) such as:
> 
> <forbid>
> <!-- ... -->
> <label-sequence>
> <label-item label="DETM"/>
> <label-item label="VLEXPFCI"/>
> </label-sequence>
> <!-- ... -->
> </forbid>

Agree.

>   We have also (historically) found that investing some time on .tsx 
> rules improves taggers measurably.

True, but the rules are fairly restrictive, allowing only bigram
contexts. In Atro Voutilainen's "Hand Crafted Rules", he gives numbers
saying that in the (famous) EngCG tagger, 10% of rules have unbounded
contexts, and 21% have a condition that is not a neighbouring word. This
may seem like a low number, but these are exactly the kind of problems
(non-neighbouring words) that we are up against, and that cannot be
taken care of with bigram rules.

$ echo "He very rarely looks that way." | apertium -d . en-es-tagger
^Prpers<prn><subj><p3><m><sg>$ ^very<preadv>$ ^rarely<adv>$
^look<vblex><pri><p3><sg>$ ^that<cnjsub>$ ^way<n><sg>$^.<sent>$^.<sent>$

In principle the finite verb + cnjsub and cnjsub + noun and noun + sent
readings are fine. The problem is that the cnjsub + noun + sent is
problematic.

> > Might help us get around tagging errors like:
> >
> > $ echo "Avui no veig el sol." | apertium -d . ca-en-tagger
> > ^Avui<adv>$ ^no<adv>$ ^veure<vblex><pri><p1><sg>$ ^el<det><def><m><sg>$
> > ^sol<adj><m><sg>$^.<sent>$^.<sent>$
> Fran, what would be a reasonable "forbidding" rule here that repairs 
> this error but does not break things somewhere else?

I would write a rule to say:

If there is an ambiguity between the sequence "definite article +
adjective/noun + sentence boundary" choose the "det noun sent" reading. 

The case where you can have det + adj + sent is where the adjective is
"nominalised". So if you already have a noun, it is _probably_ better to
choose this.

The other option would be to make a lexical rule for "sol". I can come
up with examples where this would be wrong (e.g. (?)"Hi ha més sols que
brillen aixina ? No és el sol.") but these are a bit rebuscado. 

In the Catalan Wikipedia corpus, the only examples of "el sol ." are of
the Sun.

If you can find a corpus where "el sol ." as "the only one ." exceeds
"the Sun ." I would be interested to see it :)

> > $ echo "Why does she do that?" | apertium -d . en-ca-tagger
> > ^Why<adv><itg>$ ^do<vbdo><pri><p3><sg>$ ^prpers<prn><subj><p3><f><sg>$
> > ^do<vbdo><pres>$ ^that<cnjsub>$^?<sent>$^.<sent>$
> 
> I think this could easily be dealt with in "pure", "canonical" Apertium 
> using a simple forbid rule in the .tsx file. The fact that booboos like 
> this one pass on to the transfer file is a clear indication that the 
> .tsx file in apertium-en-ca needs love, rather than justifying the need 
> for introducing a non-canonical CG3 module. I have also added a quick 
> section in http://wiki.apertium.org/wiki/Constraint_Grammar to that effect.

Great. :)

> You will notice that I make a strong point of not considering CG3 part 
> of canonical or mainstream Apertium (I hope you grant me the right to 
> show a reluctant position here as a creator of the original Apertium!). 

I also make that point. And I encourage work on development of
replacements.

> I make a similar point with respect to HFST, which is clearly 
> non-canonical Apertium. I believe that using CG3 and HFST has 
> effectively hindered reasonable usages of apertium-tagger and perhaps 
> its development, 

Part of the problem is that there is no development to apertium-tagger,
bugs take a long time to find and fix, and no improvement work has been
done since 2009. On the other hand, CG3 has weekly commits by an active
developer, and bugs are fixed in days/hours instead of weeks/months.

I really think it would be nice to have a finite-state based replacement
to CG3 in Apertium. But until we have one, if people want to fix errors
in tagging in a traceable manner, I'll recommend CG3.

> and has also moved all attention away from improving 
> the .metadix format, which has divergent dialects in different language 
> pairs.
> 
> Call me conservative and radical, but I would have rather seen some 
> development of apertium-tagger and the metadix format, 

We basically don't have the developer time. The HFST group has 4-5
active developers. lttoolbox has around 0.5. CG3 has one very active
developer, apertium-tagger perhaps 0.1. 

Improving our own tools just hasn't been a (research) objective of the
Apertium (research) community. -- The idea (as I understand it) has more
been "making the best of what we have". 

> instead of having to spend a long hour installing third-party tools such as 
> OpenFST or 
> vislcg3 on my machine before I can compile a language pair that requires 
> such a Frankenstein configuration, and which would probably would not 
> need them if we had developed the core Apertium instead of patching 
> around it. 

See above wrt. developer time. But having said that, the OpenFST+HFST
+Foma behemoth takes an hour. Installing vislcg3 is fairly painless and
done in 10 minutes or so. 

> Currently some language pairs use two different format for 
> tagger decisions and two different formats for dictionaries. This, in my 
> opinion, is far from being ideal, and may be discouraging some 
> Apertiumers. 

From my experience, it is an encouragement. For developers I've spoken
to -- admittedly typically linguists / language enthusiasts -- the
benefits of installing vislcg3 (traceable rules, not having to train the
tagger, >2-gram contexts, etc.) vastly outweighs the 10 minutes it takes
to install.

Furthermore, more discouraging for potential developers would be having
to write a morphological dictionary for their language without the
appropriate tools. 

> I am currently helping develop apertium-eng-kaz with three 
> Kazakh students and the complexity shown by this module makes it harder 
> than I thought to explain.

Which module ? The CG or the HFST ? Or both ? 

> In the past, stubbornly sticking to some design tenets such as "vintage" 
> 70's Unix-style pipelines and text formats has, in my opinion, 
> contributed to having a lean, clear, homogeneous engine.

But in many cases non-homogeneous language pair data. The metadix format
for example. Explaining that is easily as tricky as explaining vislcg3.

>  One success of 
> that is the development of multi-level transfer, with all its defects. 
> That's why I will stubbornly defend canonicality!
> 
> I hope you get the point.

Yes definitely. And I also defend canonicality, but at the same time I
want to offer the best and most productive tools to apertiumers. 

http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

* Rule-based finite-state disambiguation (currently Hrvoje is working on
it) 
* Flag diacritics in lttoolbox 

Both of these projects are intended to improve Apertium programs to make
external modules unnecessary.

Fran


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] paper on combining rules + statistics in POS tagging

Reply via email to