El ds 16 de 06 de 2012 a les 16:48 +0200, en/na Mikel Forcada va escriure: > Thanks Fran! > >
> > 20 hours (very little time!) writing disambiguation rules gives > > substantial improvements. > I have added the reference to page > http://wiki.apertium.org/wiki/Constraint_Grammar (External Links). > > I just want to call the attention to the fact that some of the rules > used by these authors could be written in "canonical", CG3-free Apertium > as "forbid" rules in .tsx files. > > For instance, the rule > > REMOVE (DET) IF (1C (VFIN)); > > corresponds to forbid rules we use in .tsx files (see, e.g. > apertium-es-ca.es.tsx) such as: > > <forbid> > <!-- ... --> > <label-sequence> > <label-item label="DETM"/> > <label-item label="VLEXPFCI"/> > </label-sequence> > <!-- ... --> > </forbid> Agree. > We have also (historically) found that investing some time on .tsx > rules improves taggers measurably. True, but the rules are fairly restrictive, allowing only bigram contexts. In Atro Voutilainen's "Hand Crafted Rules", he gives numbers saying that in the (famous) EngCG tagger, 10% of rules have unbounded contexts, and 21% have a condition that is not a neighbouring word. This may seem like a low number, but these are exactly the kind of problems (non-neighbouring words) that we are up against, and that cannot be taken care of with bigram rules. $ echo "He very rarely looks that way." | apertium -d . en-es-tagger ^Prpers<prn><subj><p3><m><sg>$ ^very<preadv>$ ^rarely<adv>$ ^look<vblex><pri><p3><sg>$ ^that<cnjsub>$ ^way<n><sg>$^.<sent>$^.<sent>$ In principle the finite verb + cnjsub and cnjsub + noun and noun + sent readings are fine. The problem is that the cnjsub + noun + sent is problematic. > > Might help us get around tagging errors like: > > > > $ echo "Avui no veig el sol." | apertium -d . ca-en-tagger > > ^Avui<adv>$ ^no<adv>$ ^veure<vblex><pri><p1><sg>$ ^el<det><def><m><sg>$ > > ^sol<adj><m><sg>$^.<sent>$^.<sent>$ > Fran, what would be a reasonable "forbidding" rule here that repairs > this error but does not break things somewhere else? I would write a rule to say: If there is an ambiguity between the sequence "definite article + adjective/noun + sentence boundary" choose the "det noun sent" reading. The case where you can have det + adj + sent is where the adjective is "nominalised". So if you already have a noun, it is _probably_ better to choose this. The other option would be to make a lexical rule for "sol". I can come up with examples where this would be wrong (e.g. (?)"Hi ha més sols que brillen aixina ? No és el sol.") but these are a bit rebuscado. In the Catalan Wikipedia corpus, the only examples of "el sol ." are of the Sun. If you can find a corpus where "el sol ." as "the only one ." exceeds "the Sun ." I would be interested to see it :) > > $ echo "Why does she do that?" | apertium -d . en-ca-tagger > > ^Why<adv><itg>$ ^do<vbdo><pri><p3><sg>$ ^prpers<prn><subj><p3><f><sg>$ > > ^do<vbdo><pres>$ ^that<cnjsub>$^?<sent>$^.<sent>$ > > I think this could easily be dealt with in "pure", "canonical" Apertium > using a simple forbid rule in the .tsx file. The fact that booboos like > this one pass on to the transfer file is a clear indication that the > .tsx file in apertium-en-ca needs love, rather than justifying the need > for introducing a non-canonical CG3 module. I have also added a quick > section in http://wiki.apertium.org/wiki/Constraint_Grammar to that effect. Great. :) > You will notice that I make a strong point of not considering CG3 part > of canonical or mainstream Apertium (I hope you grant me the right to > show a reluctant position here as a creator of the original Apertium!). I also make that point. And I encourage work on development of replacements. > I make a similar point with respect to HFST, which is clearly > non-canonical Apertium. I believe that using CG3 and HFST has > effectively hindered reasonable usages of apertium-tagger and perhaps > its development, Part of the problem is that there is no development to apertium-tagger, bugs take a long time to find and fix, and no improvement work has been done since 2009. On the other hand, CG3 has weekly commits by an active developer, and bugs are fixed in days/hours instead of weeks/months. I really think it would be nice to have a finite-state based replacement to CG3 in Apertium. But until we have one, if people want to fix errors in tagging in a traceable manner, I'll recommend CG3. > and has also moved all attention away from improving > the .metadix format, which has divergent dialects in different language > pairs. > > Call me conservative and radical, but I would have rather seen some > development of apertium-tagger and the metadix format, We basically don't have the developer time. The HFST group has 4-5 active developers. lttoolbox has around 0.5. CG3 has one very active developer, apertium-tagger perhaps 0.1. Improving our own tools just hasn't been a (research) objective of the Apertium (research) community. -- The idea (as I understand it) has more been "making the best of what we have". > instead of having to spend a long hour installing third-party tools such as > OpenFST or > vislcg3 on my machine before I can compile a language pair that requires > such a Frankenstein configuration, and which would probably would not > need them if we had developed the core Apertium instead of patching > around it. See above wrt. developer time. But having said that, the OpenFST+HFST +Foma behemoth takes an hour. Installing vislcg3 is fairly painless and done in 10 minutes or so. > Currently some language pairs use two different format for > tagger decisions and two different formats for dictionaries. This, in my > opinion, is far from being ideal, and may be discouraging some > Apertiumers. From my experience, it is an encouragement. For developers I've spoken to -- admittedly typically linguists / language enthusiasts -- the benefits of installing vislcg3 (traceable rules, not having to train the tagger, >2-gram contexts, etc.) vastly outweighs the 10 minutes it takes to install. Furthermore, more discouraging for potential developers would be having to write a morphological dictionary for their language without the appropriate tools. > I am currently helping develop apertium-eng-kaz with three > Kazakh students and the complexity shown by this module makes it harder > than I thought to explain. Which module ? The CG or the HFST ? Or both ? > In the past, stubbornly sticking to some design tenets such as "vintage" > 70's Unix-style pipelines and text formats has, in my opinion, > contributed to having a lean, clear, homogeneous engine. But in many cases non-homogeneous language pair data. The metadix format for example. Explaining that is easily as tricky as explaining vislcg3. > One success of > that is the development of multi-level transfer, with all its defects. > That's why I will stubbornly defend canonicality! > > I hope you get the point. Yes definitely. And I also defend canonicality, but at the same time I want to offer the best and most productive tools to apertiumers. http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code * Rule-based finite-state disambiguation (currently Hrvoje is working on it) * Flag diacritics in lttoolbox Both of these projects are intended to improve Apertium programs to make external modules unnecessary. Fran ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
