Hi all, To allow for better disambiguation and have better rules, I need to include a valency dictionary with LT. These are dictionaries that specify which grammatical cases or prepositions go with which verbs etc. There are such resources for many languages that we support. And using these resources, we could enrich POS tag disambiguation a lot (I'm using a horribly long regular expression right now instead of a dictionary, for example), and write up a lot of important rules.
The obvious choice for representing the dictionary (which is available for Polish on a fairly liberal license) is to use a finite-state lexicon that we normally use for taggers. The dictionary will be applied after tagging because valency dictionary will require POS tag + lexeme information. In Polish, the entries look like this: absurdalny: pewny: : : : {prepnp(dla,gen)} absurdalny: pewny: : : : {prepnp(w,loc)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)} absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)} absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)} But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are paragraph-based: VAL$ abaisser: P0 P1 VTYPE$ predicator simple VERB$ ABAISSER/abaisser NUM$ 10 EG$ il faudra abaisser la persienne TR_DU$ laten zakken, neerhalen, neerlaten, doen dalen TR_EN$ let down, lower FRAME$ subj:pron|n:[hum], obj:pron|n:[nhum,?abs] P0$ (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, ceux-ci P1$ que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci RP$ passif être, se passif AUX$ avoir VAL$ abaisser: P0 (P1) VTYPE$ predicator simple VERB$ ABAISSER/abaisser NUM$ 20 EG$ il a raconté cette anecdote pour m'abaisser TR_DU$ vernederen, kleineren TR_EN$ humiliate FRAME$ subj:pron|n:[hum], ?obj:pron|n:[hum] P0$ (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci, (ça(de_inf)) P1$ 0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci, l'un l'autre RP$ passif être, se faire passif AUX$ avoir I would also add a new list of valency attributes to every AnalyzedToken, simply as a string value (parsing the string would be overkill because there might be different ways of encoding valency information for different languages), with appropriate getters and setters. To use this, we will need some additional attributes in XML elements: token and exception. The following notation seems to be fairly fine: <token postag="verb" valence="WHATEVER_STRING"/> I think matching valence should, by default, use regular expressions. I'm not sure if negation is needed. Alternatively, we could go for attribute-value pairs but the problem would be how to make this language-independent and use this in our XML notation. The easiest way would be then probably define types and their ids just the way I did this for unification. For French, this could look like: <token>abaisser<valency id="AUX">avoir</valency><valency id="VTYPE">predicator simple</valency></token> Of course, the textual values of <valency> could be made into regexes and negated. For this, of course we would need to write up a key-value parser for all particular valency dictionaries. But this would definitely speed up matching and would make writing rules much easier. Best ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel